[gdal-dev] GDAL raster processing: parallel computing

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[gdal-dev] GDAL raster processing: parallel computing

Ari Jolma-2
Hi,

This relates to RFC 62, raster algebra.

I realized that parallel processing is really an essential element of
this. I don't have a lot of experience with parallel processing and
threads so please let me know if I'm writing silly or ignorant things.

James, in your emails you write that map and reduce functions are
essential. That seems to point to parallel processing - can you
elaborate a bit more, what's you approach there and are you using some
specific libraries etc?

Rutger mentioned Dask and Numba, which seem to be a high level solution.

Anyway, I thought I'd make a try with OpenMP and the C++ code I have
written so far. On a very very simple level it seems that it might be
enough to add "#pragma omp parallel for" before the for loop, which
iterates over the (cached) blocks. And then compile the code with
-fopenmp. Of course this does not work (or it seems to work but not make
the code use more than one cpu at the same time) since a single GDAL
Dataset object should not be used by several threads (GDAL FAQ).

There seems to be a solution in a book "Remote Sensing Raster
Programming", which I found with google and google books shows the
relevant page. The book suggests adding #pragma omp barrier before
GDALRasterIO. To me it seems that that would cause all the raster data
to accumulate into the RAM. I did not try it though.

It seems that I should somehow make the code spawn a new Dataset object
for each thread. The function for that is GDALOpenShared. Now a simple
question: What if the raster is created in the code? My test application
for this is a simple one, which takes an existing raster and returns a
0/1 raster, where the cell has 1 if the original raster has value 48 and
0 elsewhere. Is the solution to create the dataset, and then open new
connections to it using OpenShared?

By the way, I'll be at the FOSS4G code sprint Tuesday afternoon and
Saturday morning if anyone wants to discuss this.

Best,

Ari


_______________________________________________
gdal-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/gdal-dev
Reply | Threaded
Open this post in threaded view
|

Re: GDAL raster processing: parallel computing

Even Rouault-2
Hi Ari,

>
> This relates to RFC 62, raster algebra.
>
> I realized that parallel processing is really an essential element of
> this. I don't have a lot of experience with parallel processing and
> threads so please let me know if I'm writing silly or ignorant things.
>
> James, in your emails you write that map and reduce functions are
> essential. That seems to point to parallel processing - can you
> elaborate a bit more, what's you approach there and are you using some
> specific libraries etc?
>
> Rutger mentioned Dask and Numba, which seem to be a high level solution.

There's also Spark that is popular :
https://en.wikipedia.org/wiki/Apache_Spark

There's also MPI: https://en.wikipedia.org/wiki/Message_Passing_Interface

I've no direct experience with any of those though. Happy to hear about other
comments. But I'm not sure we would want to use directly one of those
solutions into GDAL itself

>
> Anyway, I thought I'd make a try with OpenMP and the C++ code I have
> written so far. On a very very simple level it seems that it might be
> enough to add "#pragma omp parallel for" before the for loop, which
> iterates over the (cached) blocks. And then compile the code with
> -fopenmp. Of course this does not work (or it seems to work but not make
> the code use more than one cpu at the same time) since a single GDAL
> Dataset object should not be used by several threads (GDAL FAQ).

You should see some multi-threaded use though, but indeed with a likely crash
at some point when the dataset object is used by several threads.

>
> There seems to be a solution in a book "Remote Sensing Raster
> Programming", which I found with google and google books shows the
> relevant page. The book suggests adding #pragma omp barrier before
> GDALRasterIO. To me it seems that that would cause all the raster data
> to accumulate into the RAM. I did not try it though.

This should just put a mutex around the GDALRasterIO call. Not sure why data
would accumulate into the RAM. I mean you should declare the buffer that
receives the raster data to be thread private. There's a OMP directive for
that.

Although OMP looks like it should be easy to use, I personnaly find it a bit
hard to master and make sure you do things right because it is easy to forget
to declare that some variable must be thread private. So until now, I've done
multi-threading at the old school way and here you can more easily see what is
shared vs thread-private. There's a port/cpl_worker_thread_pool.h class I've
used in a few places : warpkernel, gdalgrid, GTiff driver multi-threaded
compression. But this is mostly personal taste. Someone experienced with OMP
should manage to write correct code, and in a less verbose way than explicit
multithreading.

>
> It seems that I should somehow make the code spawn a new Dataset object
> for each thread. The function for that is GDALOpenShared.

Not really. GDALOpenShared("foo") use case is more for the single-threaded
usage where you could have sometimes the need to open the same file but from
different places and would want to share the same dataset so as to be able to
take advantage of the cache. This was the case for the old way the VRT driver
managed its sources (think of a RGB TIFF wrapped as a VRT: there are 3 sources
pointing to the same TIFF file).

In the multi-threaded case, GDALOpenShared() will return different objects if
called with the same dataset name from different threads. Like GDALOpen().

> Now a simple
> question: What if the raster is created in the code? My test application
> for this is a simple one, which takes an existing raster and returns a
> 0/1 raster, where the cell has 1 if the original raster has value 48 and
> 0 elsewhere. Is the solution to create the dataset, and then open new
> connections to it using OpenShared?

If it is a in-memory raster created with the MEM driver *and* if you do
RasterIO requests such as nXSize == nBufXSize && nYSize == nBufYSize (ie no
resampling), then you can share the same dataset object since the
implementation doesn't require accessing to blocks.
Otherwise you'll need different datasets objects, or put a mutex around
RasterIO calls.

>
> By the way, I'll be at the FOSS4G code sprint Tuesday afternoon and
> Saturday morning if anyone wants to discuss this.

I'll be there too.

Even

--
Spatialys - Geospatial professional services
http://www.spatialys.com
_______________________________________________
gdal-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/gdal-dev
Reply | Threaded
Open this post in threaded view
|

Re: GDAL raster processing: parallel computing

jramm
In reply to this post by Ari Jolma-2
Hi, 
I wasn't talking about MapReduce per se (although this would also be very interesting), but rather the 'classic' functional programming functions of 'map' and 'reduce'.

A 'map' function essentially takes some user defined function and applies it to every element of a list/vector or some other iterable. For a raster dataset, I was proposing that it take a user defined function or class instance and applies it to every 'block' of a raster dataset - with plenty of choices about how each block is supplied ('natural' blocks, user defined sizes, overlaps or 'pixel' buffers etc).

A 'reduce' function aims to apply a user defined function cumulatively to the iterable, so as to reduce it to a single value. For raster dataset(s) then, a 'reduce' function would take in multiple datasets and pass the block of each dataset to the user-defined function together, expecting a single block to be returned. I.E it reduces multiple datasets to a single datasets. Image mosaicking would be an example of this. 

I think a 'map' and 'reduce' function is all that is needed to allow the user to do pretty much any kind of processing they want, without actually having to worry about how to apply to a raster dataset of any size and all that 'boring' boilerplate. Just define the algorithm and hand it over. 

A MapReduce system would be, perhaps, a higher level 'organiser' of how these tasks are run - perhaps distributing them across multiple machines. 
Wikipedia does a much better job of explaining MapReduce vs map & reduce than me:

I started to gather my ideas and snippets of code in a repository here:

It is more of a place to help me form ideas than anything usable atm. 

I would be open to merging relevant parts etc. 

What I think is important is:

- Different kinds of block 'iterators' so you can support e.g. overlapping neighbourhoods, mosaicking etc. Iterating in blocks ofc is essential for handling rasters of any size
- Ability to save state in the callbacks (I decided to support this by making the callback a class instance conforming to some interface rather than a function...i guess there are other ways too)
- It would be good to support complex raster 'masks' as well (e.g. different regions of the mask indicating different processing parameters or something). I am going to clarify my ideas on this on a wiki page in the above repo.

When I have tried parallel processing in the past, I have generally done it on the block array, which neatly avoids having to have multiple GDALDataset instances and possible getting in a tangle with that. However, this limits the usefulness of parallel processing to only the most complicated of algorithms as most of the time I find processing to be heavily I/O bound. 
I was hoping to at some point come up with a parallel, block based alternative for gdal Polygonize as it runs quite slowly for large, tiled geotiffs, where access by scanline is sub optimal. A polygonize function also seems sufficiently complex to benefit from parallelisation. OpenCV probably has much of the necessary work already done - of course, work relying on OpenCV might be better outside gdal, rather than introducing such a large & specialised dependency. 

Ill be in Bonn from late Tues afternoon for the conference...

I have t

On 17 August 2016 at 13:28, Ari Jolma <[hidden email]> wrote:
Hi,

This relates to RFC 62, raster algebra.

I realized that parallel processing is really an essential element of this. I don't have a lot of experience with parallel processing and threads so please let me know if I'm writing silly or ignorant things.

James, in your emails you write that map and reduce functions are essential. That seems to point to parallel processing - can you elaborate a bit more, what's you approach there and are you using some specific libraries etc?

Rutger mentioned Dask and Numba, which seem to be a high level solution.

Anyway, I thought I'd make a try with OpenMP and the C++ code I have written so far. On a very very simple level it seems that it might be enough to add "#pragma omp parallel for" before the for loop, which iterates over the (cached) blocks. And then compile the code with -fopenmp. Of course this does not work (or it seems to work but not make the code use more than one cpu at the same time) since a single GDAL Dataset object should not be used by several threads (GDAL FAQ).

There seems to be a solution in a book "Remote Sensing Raster Programming", which I found with google and google books shows the relevant page. The book suggests adding #pragma omp barrier before GDALRasterIO. To me it seems that that would cause all the raster data to accumulate into the RAM. I did not try it though.

It seems that I should somehow make the code spawn a new Dataset object for each thread. The function for that is GDALOpenShared. Now a simple question: What if the raster is created in the code? My test application for this is a simple one, which takes an existing raster and returns a 0/1 raster, where the cell has 1 if the original raster has value 48 and 0 elsewhere. Is the solution to create the dataset, and then open new connections to it using OpenShared?

By the way, I'll be at the FOSS4G code sprint Tuesday afternoon and Saturday morning if anyone wants to discuss this.

Best,

Ari


_______________________________________________
gdal-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/gdal-dev


_______________________________________________
gdal-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/gdal-dev
Reply | Threaded
Open this post in threaded view
|

Re: GDAL raster processing: parallel computing

Tim Keitt-4



On Wed, Aug 17, 2016 at 3:26 PM, James Ramm <[hidden email]> wrote:
Hi, 
I wasn't talking about MapReduce per se (although this would also be very interesting), but rather the 'classic' functional programming functions of 'map' and 'reduce'.

A 'map' function essentially takes some user defined function and applies it to every element of a list/vector or some other iterable. For a raster dataset, I was proposing that it take a user defined function or class instance and applies it to every 'block' of a raster dataset - with plenty of choices about how each block is supplied ('natural' blocks, user defined sizes, overlaps or 'pixel' buffers etc).

A 'reduce' function aims to apply a user defined function cumulatively to the iterable, so as to reduce it to a single value. For raster dataset(s) then, a 'reduce' function would take in multiple datasets and pass the block of each dataset to the user-defined function together, expecting a single block to be returned. I.E it reduces multiple datasets to a single datasets. Image mosaicking would be an example of this. 

I think a 'map' and 'reduce' function is all that is needed to allow the user to do pretty much any kind of processing they want, without actually having to worry about how to apply to a raster dataset of any size and all that 'boring' boilerplate. Just define the algorithm and hand it over. 

A MapReduce system would be, perhaps, a higher level 'organiser' of how these tasks are run - perhaps distributing them across multiple machines. 
Wikipedia does a much better job of explaining MapReduce vs map & reduce than me:

I started to gather my ideas and snippets of code in a repository here:

It is more of a place to help me form ideas than anything usable atm. 

I would be open to merging relevant parts etc. 

What I think is important is:

- Different kinds of block 'iterators' so you can support e.g. overlapping neighbourhoods, mosaicking etc. Iterating in blocks ofc is essential for handling rasters of any size
- Ability to save state in the callbacks (I decided to support this by making the callback a class instance conforming to some interface rather than a function...i guess there are other ways too)
- It would be good to support complex raster 'masks' as well (e.g. different regions of the mask indicating different processing parameters or something). I am going to clarify my ideas on this on a wiki page in the above repo.

When I have tried parallel processing in the past, I have generally done it on the block array, which neatly avoids having to have multiple GDALDataset instances and possible getting in a tangle with that. However, this limits the usefulness of parallel processing to only the most complicated of algorithms as most of the time I find processing to be heavily I/O bound. 
I was hoping to at some point come up with a parallel, block based alternative for gdal Polygonize as it runs quite slowly for large, tiled geotiffs, where access by scanline is sub optimal. A polygonize function also seems sufficiently complex to benefit from parallelisation. OpenCV probably has much of the necessary work already done - of course, work relying on OpenCV might be better outside gdal, rather than introducing such a large & specialised dependency. 

I would tend to agree. GDAL is really good at translation, but not such a great platform for implementing generic iterator-based algorithms. I would suggest using GDAL to translate and then write algorithms separately, where you could e.g. take advantage of newer C++ features. The Generic Image Library is one possibility.

THK
 

Ill be in Bonn from late Tues afternoon for the conference...

I have t

On 17 August 2016 at 13:28, Ari Jolma <[hidden email]> wrote:
Hi,

This relates to RFC 62, raster algebra.

I realized that parallel processing is really an essential element of this. I don't have a lot of experience with parallel processing and threads so please let me know if I'm writing silly or ignorant things.

James, in your emails you write that map and reduce functions are essential. That seems to point to parallel processing - can you elaborate a bit more, what's you approach there and are you using some specific libraries etc?

Rutger mentioned Dask and Numba, which seem to be a high level solution.

Anyway, I thought I'd make a try with OpenMP and the C++ code I have written so far. On a very very simple level it seems that it might be enough to add "#pragma omp parallel for" before the for loop, which iterates over the (cached) blocks. And then compile the code with -fopenmp. Of course this does not work (or it seems to work but not make the code use more than one cpu at the same time) since a single GDAL Dataset object should not be used by several threads (GDAL FAQ).

There seems to be a solution in a book "Remote Sensing Raster Programming", which I found with google and google books shows the relevant page. The book suggests adding #pragma omp barrier before GDALRasterIO. To me it seems that that would cause all the raster data to accumulate into the RAM. I did not try it though.

It seems that I should somehow make the code spawn a new Dataset object for each thread. The function for that is GDALOpenShared. Now a simple question: What if the raster is created in the code? My test application for this is a simple one, which takes an existing raster and returns a 0/1 raster, where the cell has 1 if the original raster has value 48 and 0 elsewhere. Is the solution to create the dataset, and then open new connections to it using OpenShared?

By the way, I'll be at the FOSS4G code sprint Tuesday afternoon and Saturday morning if anyone wants to discuss this.

Best,

Ari


_______________________________________________
gdal-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/gdal-dev


_______________________________________________
gdal-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/gdal-dev


_______________________________________________
gdal-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/gdal-dev