Parallel Support

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Parallel Support

Paul Ramsey
FYI, I have a parallel query/aggregate branch going here

https://github.com/pramsey/postgis/tree/parallel

I've marked most of the functions as PARALLEL SAFE, for better or worse.

Aggregates are frustrating, the one that we probably want to
parallelize the most, ST_Union, is quite tricky to do. Basically, we
need to get parallelism into the transfn stage, since by the time you
get to the combinefn or finalfn the result has already been returned
to the master. In order to get some work done in the transfn I think
we basically need to run a union every N records, which means a bad
magic number in there, as well as washing out the benefits of cascaded
union.

You can still test a parallel union aggregate though, the ST_MemUnion
aggregate is trivial to parallelize, and I have done so. Also
ST_Extent. ST_Collect doesn't have any benefit to parallelizing (since
it's mostly about memory copying).

For testing you'll probably end up messing with the parallel gucs
which are described here:

https://gist.github.com/pramsey/ff7cbf70dbe581189565

P.
_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel
Reply | Threaded
Open this post in threaded view
|

Re: Parallel Support

Rémi Cura
I've been watching the patch discussion on postgres list,
I must say this is a most wanted feature !
It is going to take Postgis to another level in terms of scalability.

Currently I use python for parallel query, and it is a major hassle.
PGadmin scripts are a little bit easier, but quit annoying as well.

Cheers,
Rémi-C

2016-03-25 20:20 GMT+01:00 Paul Ramsey <[hidden email]>:
FYI, I have a parallel query/aggregate branch going here

https://github.com/pramsey/postgis/tree/parallel

I've marked most of the functions as PARALLEL SAFE, for better or worse.

Aggregates are frustrating, the one that we probably want to
parallelize the most, ST_Union, is quite tricky to do. Basically, we
need to get parallelism into the transfn stage, since by the time you
get to the combinefn or finalfn the result has already been returned
to the master. In order to get some work done in the transfn I think
we basically need to run a union every N records, which means a bad
magic number in there, as well as washing out the benefits of cascaded
union.

You can still test a parallel union aggregate though, the ST_MemUnion
aggregate is trivial to parallelize, and I have done so. Also
ST_Extent. ST_Collect doesn't have any benefit to parallelizing (since
it's mostly about memory copying).

For testing you'll probably end up messing with the parallel gucs
which are described here:

https://gist.github.com/pramsey/ff7cbf70dbe581189565

P.
_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel


_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel
Reply | Threaded
Open this post in threaded view
|

Re: Parallel Support

nicklas
In reply to this post by Paul Ramsey
Hi Paul

I saw your blog post about this and it made things clearer.

I don't know if I miss something here or if I have a point.

About the problem with using parallel usage with cascaded st_union it
seems like a perfect fit from my naive view. I haven't looked at any
code and don't know how the traverse of the tree is done.
But:

If the master creates the tree. I don't know what the tree looks like
and if we can manipulate it. But let's say the tree has 2 children per
node, and we constraint the parallel usage to have 2 raised by x number
of workers. If we then want 4 workers we go down in the tree until we we
have 4 nodes horizontally. Distribute those to the workers.

This would mean building the tree is done by the master in an initial
function, and then the transfer functions (the workers) "walk the tree"

So, is it possible for transfer functions to walk a tree or do they have
to get all records defined in advance?

Thanks

Nicklas




On Fri, 2016-03-25 at 12:20 -0700, Paul Ramsey wrote:

> FYI, I have a parallel query/aggregate branch going here
>
> https://github.com/pramsey/postgis/tree/parallel
>
> I've marked most of the functions as PARALLEL SAFE, for better or worse.
>
> Aggregates are frustrating, the one that we probably want to
> parallelize the most, ST_Union, is quite tricky to do. Basically, we
> need to get parallelism into the transfn stage, since by the time you
> get to the combinefn or finalfn the result has already been returned
> to the master. In order to get some work done in the transfn I think
> we basically need to run a union every N records, which means a bad
> magic number in there, as well as washing out the benefits of cascaded
> union.
>
> You can still test a parallel union aggregate though, the ST_MemUnion
> aggregate is trivial to parallelize, and I have done so. Also
> ST_Extent. ST_Collect doesn't have any benefit to parallelizing (since
> it's mostly about memory copying).
>
> For testing you'll probably end up messing with the parallel gucs
> which are described here:
>
> https://gist.github.com/pramsey/ff7cbf70dbe581189565
>
> P.
> _______________________________________________
> postgis-devel mailing list
> [hidden email]
> http://lists.osgeo.org/mailman/listinfo/postgis-devel
>


_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel
Reply | Threaded
Open this post in threaded view
|

Re: Parallel Support

Paul Ramsey
No, that doesn't work. The way parallel aggregate works is to run the
transfns in the workers, so the very act of gathering results into the
initial set, pre-tree, happens in the workers. Then everything gets
passed to the combinefns on the master and then the finalfn happens
(and the cascade could happen either at the combine or final stage).
There is no "second change" to go and send the tree back for
parallelism (excepting just doing a threaded stage ourselves there,
which is entirely possible but outside the scope of pgsql
parallelism).

The best suggestion so far has been from Staphen Frost, to allow the
workers to run their own "finalfn" or a "worker-side combine" as I
call it, so that we can cascade the sets first at the worker level,
then run one final combine on master before returning.

P.



On Tue, Mar 29, 2016 at 8:40 AM, Nicklas Avén
<[hidden email]> wrote:

> Hi Paul
>
> I saw your blog post about this and it made things clearer.
>
> I don't know if I miss something here or if I have a point.
>
> About the problem with using parallel usage with cascaded st_union it
> seems like a perfect fit from my naive view. I haven't looked at any
> code and don't know how the traverse of the tree is done.
> But:
>
> If the master creates the tree. I don't know what the tree looks like
> and if we can manipulate it. But let's say the tree has 2 children per
> node, and we constraint the parallel usage to have 2 raised by x number
> of workers. If we then want 4 workers we go down in the tree until we we
> have 4 nodes horizontally. Distribute those to the workers.
>
> This would mean building the tree is done by the master in an initial
> function, and then the transfer functions (the workers) "walk the tree"
>
> So, is it possible for transfer functions to walk a tree or do they have
> to get all records defined in advance?
>
> Thanks
>
> Nicklas
>
>
>
>
> On Fri, 2016-03-25 at 12:20 -0700, Paul Ramsey wrote:
>> FYI, I have a parallel query/aggregate branch going here
>>
>> https://github.com/pramsey/postgis/tree/parallel
>>
>> I've marked most of the functions as PARALLEL SAFE, for better or worse.
>>
>> Aggregates are frustrating, the one that we probably want to
>> parallelize the most, ST_Union, is quite tricky to do. Basically, we
>> need to get parallelism into the transfn stage, since by the time you
>> get to the combinefn or finalfn the result has already been returned
>> to the master. In order to get some work done in the transfn I think
>> we basically need to run a union every N records, which means a bad
>> magic number in there, as well as washing out the benefits of cascaded
>> union.
>>
>> You can still test a parallel union aggregate though, the ST_MemUnion
>> aggregate is trivial to parallelize, and I have done so. Also
>> ST_Extent. ST_Collect doesn't have any benefit to parallelizing (since
>> it's mostly about memory copying).
>>
>> For testing you'll probably end up messing with the parallel gucs
>> which are described here:
>>
>> https://gist.github.com/pramsey/ff7cbf70dbe581189565
>>
>> P.
>> _______________________________________________
>> postgis-devel mailing list
>> [hidden email]
>> http://lists.osgeo.org/mailman/listinfo/postgis-devel
>>
>
>
> _______________________________________________
> postgis-devel mailing list
> [hidden email]
> http://lists.osgeo.org/mailman/listinfo/postgis-devel
_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel
Reply | Threaded
Open this post in threaded view
|

Re: Parallel Support

Sandro Santilli-2
On Tue, Mar 29, 2016 at 08:46:20AM -0700, Paul Ramsey wrote:

> The best suggestion so far has been from Staphen Frost, to allow the
> workers to run their own "finalfn" or a "worker-side combine" as I
> call it, so that we can cascade the sets first at the worker level,
> then run one final combine on master before returning.

How about handling threads directly from within postgis
or in the CascadedUnion case directly from within libgeos ?

What do PostgreSQL threads add to custom threads ?

--strk;
_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel
Reply | Threaded
Open this post in threaded view
|

Re: Parallel Support

Paul Ramsey
PostgreSQL specifically doesn't do threading, it fires up background
workers on the fly.
The main thing we get from PostgreSQL is functionality "for free".
Free sequence parallel, free join parallel, etc.
If we start threading we bang up against platform differences a little
harder than usual. I have a threading mode in the kmeans clustering,
but it's pthreads, which makes windows compilation problematic. We
could pull in an abstraction layer *as well*, but then... also the
question of how many threads to use is a bit open, though I guess now
we could just lean on the max_parallel_workers GUC to get the base
database configuration info for that.
Basically we haven't done threads on our side for historical reasons
of portability and complexity. Times have changed though, so maybe
it's a safer bet now.

P.


On Tue, Mar 29, 2016 at 9:44 AM, Sandro Santilli <[hidden email]> wrote:

> On Tue, Mar 29, 2016 at 08:46:20AM -0700, Paul Ramsey wrote:
>
>> The best suggestion so far has been from Staphen Frost, to allow the
>> workers to run their own "finalfn" or a "worker-side combine" as I
>> call it, so that we can cascade the sets first at the worker level,
>> then run one final combine on master before returning.
>
> How about handling threads directly from within postgis
> or in the CascadedUnion case directly from within libgeos ?
>
> What do PostgreSQL threads add to custom threads ?
>
> --strk;
> _______________________________________________
> postgis-devel mailing list
> [hidden email]
> http://lists.osgeo.org/mailman/listinfo/postgis-devel
_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel
Reply | Threaded
Open this post in threaded view
|

Re: Parallel Support

Even Rouault-2
Le mardi 29 mars 2016 20:43:41, Paul Ramsey a écrit :
> PostgreSQL specifically doesn't do threading, it fires up background
> workers on the fly.
> The main thing we get from PostgreSQL is functionality "for free".
> Free sequence parallel, free join parallel, etc.
> If we start threading we bang up against platform differences a little
> harder than usual. I have a threading mode in the kmeans clustering,
> but it's pthreads, which makes windows compilation problematic. We
> could pull in an abstraction layer *as well*,

There's GDAL's abstraction, C callable, that could be used ( in
cpl_multiproc.h ). If you don't want to depend on GDAL just for that, you
could probably just borrow what is needed from port/cpl_multiproc.cpp

> but then... also the
> question of how many threads to use is a bit open, though I guess now
> we could just lean on the max_parallel_workers GUC to get the base
> database configuration info for that.
> Basically we haven't done threads on our side for historical reasons
> of portability and complexity. Times have changed though, so maybe
> it's a safer bet now.
>
> P.
>
> On Tue, Mar 29, 2016 at 9:44 AM, Sandro Santilli <[hidden email]> wrote:
> > On Tue, Mar 29, 2016 at 08:46:20AM -0700, Paul Ramsey wrote:
> >> The best suggestion so far has been from Staphen Frost, to allow the
> >> workers to run their own "finalfn" or a "worker-side combine" as I
> >> call it, so that we can cascade the sets first at the worker level,
> >> then run one final combine on master before returning.
> >
> > How about handling threads directly from within postgis
> > or in the CascadedUnion case directly from within libgeos ?
> >
> > What do PostgreSQL threads add to custom threads ?
> >
> > --strk;
> > _______________________________________________
> > postgis-devel mailing list
> > [hidden email]
> > http://lists.osgeo.org/mailman/listinfo/postgis-devel
>
> _______________________________________________
> postgis-devel mailing list
> [hidden email]
> http://lists.osgeo.org/mailman/listinfo/postgis-devel

--
Spatialys - Geospatial professional services
http://www.spatialys.com
_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel