Multicore Processing and Temporary File Cleanup

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Multicore Processing and Temporary File Cleanup

joechip90
Dear All,

I have looked around on other postings and it appears that the majority (if not all) of the GRASS libraries are NOT thread safe.  Unfortunately I have a very large processing job that would benefit from cluster processing.  I have written a script that can be run on multiple processors whilst being very careful not to allow different processes to try to modify the same data at any point.  The same raster file is not accessed by different processes at all in fact.

However, I also realise that alone might not solve all my problems.  In any one process some temporary files are created (by GRASS libraries) and then these are deleted on statup (cleaning temporary files...).  Now I was wondering what these temporary files were and if there might be a problem with one process creating temporary files that it needs whilst another process starts up GRASS and deletes them.  Is there any way to call GRASS in a way that doesn't delete the temporary files?

I appreciate that I'm trying to do something that GRASS doesn't really support but I was hoping that it might be possible to fiddle around and find a way.  Any help would be gratefully received.

I have included the script that I'm trying to run below (the script will be run many times accross multiple processors).  Any advice welcome:

example
Reply | Threaded
Open this post in threaded view
|

Re: Multicore Processing and Temporary File Cleanup

Markus Neteler
Joseph,

I am using a cluster right now which is based on PBS to elaborate MODIS
satellite data. Some answers below:

On Feb 13, 2008 2:43 PM, joechip90 <[hidden email]> wrote:
>
> Dear All,
>
> I have looked around on other postings and it appears that the majority (if
> not all) of the GRASS libraries are NOT thread safe.

Yes, unfortunately true.

> Unfortunately I have a
> very large processing job that would benefit from cluster processing.  I
> have written a script that can be run on multiple processors whilst being
> very careful not to allow different processes to try to modify the same data
> at any point.  The same raster file is not accessed by different processes
> at all in fact.

Yes, fine. Essentially there are at least two approaches of "poor man"
parallelization without modifying GRASS source code:

- split map into spatial chunks (possibly with overlap to gain smooth results)
- time series: run each map elaboration on a different node.

> However, I also realise that alone might not solve all my problems.  In any
> one process some temporary files are created (by GRASS libraries) and then
> these are deleted on statup (cleaning temporary files...).  Now I was
> wondering what these temporary files were and if there might be a problem
> with one process creating temporary files that it needs whilst another
> process starts up GRASS and deletes them.  Is there any way to call GRASS in
> a way that doesn't delete the temporary files?

You could just modify the start script and remove that call for "clean_temp".
BUT:
I am currently elaborating some thousand maps for the same region (time
series). I elaborate each map in the same location but a different mapset
(simply using the map name as mapset name). At the end of the elaboration I
call a second batch job which only contains g.copy to copy the result into a
common mapset. There is a low risk of race condition here in case that two
nodes finish at the same time but this could be even trapped in a loop which
checks if the target mapset is locked and, if needed, launches g.copy again till
success.

> I appreciate that I'm trying to do something that GRASS doesn't really
> support but I was hoping that it might be possible to fiddle around and find
> a way.  Any help would be gratefully received.

To some extend GRASS supports what you need.
I have drafted a related wiki page at:
http://grass.gdf-hannover.de/wiki/Parallel_GRASS_jobs

Feel free to hack that page!

Good luck,
Markus
_______________________________________________
grass-user mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-user
Reply | Threaded
Open this post in threaded view
|

Re: Multicore Processing and Temporary File Cleanup

joechip90
Thank you Markus your Wiki entry is most helpful.

It seems I need to make a few changes to my files and set up a large
number of mapsets in every location.  Is it appropriate then to have
multiple mapsets (one for each node) at a given location?  If so is
there a way to automatically generate multiple mapsets in a given
location such that I can jump straight into GRASS using a script command
along the following in each of the processes (I will have thousands of
processes)?

#!/bin/bash

declare -r PROCESS_NUM=__ #Some allocated process number - $SGE_TASK_ID
for Sun Grid

# Other non-GRASS commands here - in my script there is a call to an
external database
# to download parameter values

grass62 -text database/location/${PROCESS_NUM}_mapset <<!
    # Some grass commands here
!

In each mapset would then contain the spatial data that each process
will use.  You suggest then copying the output into a single shared
mapset such as PERMANENT.  For my purposes I'll probably just save them
as text files (the data then gets transferred to another program for the
next stages of processing).

Again many thanks,

Markus Neteler wrote:

> Joseph,
>
> I am using a cluster right now which is based on PBS to elaborate MODIS
> satellite data. Some answers below:
>
> On Feb 13, 2008 2:43 PM, joechip90 <[hidden email]> wrote:
>  
>> Dear All,
>>
>> I have looked around on other postings and it appears that the majority (if
>> not all) of the GRASS libraries are NOT thread safe.
>>    
>
> Yes, unfortunately true.
>
>  
>> Unfortunately I have a
>> very large processing job that would benefit from cluster processing.  I
>> have written a script that can be run on multiple processors whilst being
>> very careful not to allow different processes to try to modify the same data
>> at any point.  The same raster file is not accessed by different processes
>> at all in fact.
>>    
>
> Yes, fine. Essentially there are at least two approaches of "poor man"
> parallelization without modifying GRASS source code:
>
> - split map into spatial chunks (possibly with overlap to gain smooth results)
> - time series: run each map elaboration on a different node.
>
>  
>> However, I also realise that alone might not solve all my problems.  In any
>> one process some temporary files are created (by GRASS libraries) and then
>> these are deleted on statup (cleaning temporary files...).  Now I was
>> wondering what these temporary files were and if there might be a problem
>> with one process creating temporary files that it needs whilst another
>> process starts up GRASS and deletes them.  Is there any way to call GRASS in
>> a way that doesn't delete the temporary files?
>>    
>
> You could just modify the start script and remove that call for "clean_temp".
> BUT:
> I am currently elaborating some thousand maps for the same region (time
> series). I elaborate each map in the same location but a different mapset
> (simply using the map name as mapset name). At the end of the elaboration I
> call a second batch job which only contains g.copy to copy the result into a
> common mapset. There is a low risk of race condition here in case that two
> nodes finish at the same time but this could be even trapped in a loop which
> checks if the target mapset is locked and, if needed, launches g.copy again till
> success.
>
>  
>> I appreciate that I'm trying to do something that GRASS doesn't really
>> support but I was hoping that it might be possible to fiddle around and find
>> a way.  Any help would be gratefully received.
>>    
>
> To some extend GRASS supports what you need.
> I have drafted a related wiki page at:
> http://grass.gdf-hannover.de/wiki/Parallel_GRASS_jobs
>
> Feel free to hack that page!
>
> Good luck,
> Markus
>
>  
_______________________________________________
grass-user mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-user
Reply | Threaded
Open this post in threaded view
|

Re: Multicore Processing and Temporary File Cleanup

Markus Neteler
On Feb 13, 2008 7:46 PM, Joseph Chipperfield <[hidden email]> wrote:
> Thank you Markus your Wiki entry is most helpful.
>
> It seems I need to make a few changes to my files and set up a large
> number of mapsets in every location.  Is it appropriate then to have
> multiple mapsets (one for each node) at a given location?

Sure!

> If so is
> there a way to automatically generate multiple mapsets in a given
> location such that I can jump straight into GRASS using a script command
> along the following in each of the processes (I will have thousands of
> processes)?

Yes. When you start GRASS with path to grassdada/location/mapset/ and
the mapset does not exist, it will be automatically created.
(hint for Hamish: this will then create a valid mapset, i.e. incl. DBF driver
 predefined - see grass-dev discussions)

As first step in your script, be sure to run
  g.mapsets add=mapset1_with_data[,mapset2_with_data]
to make the data to be elaborated accessible.

I am processing thousands of MODIS data like that right now:
- GRASS is launched as (example, indeed I loop over many map names)

loop over many map names, like "aqua_lst1km20020706.LST_Night_1km.filt"

------- snip -----------
MYMAPSET=$CURRMAP
TARGETMAPSET=results

GRASS_BATCH_JOB=/shareddisk/modis_job.sh
grass63 -text /shareddisk/grassdata/myloc/$MYMAPSET

# copy over result to target mapset
export INMAP=${CURRMAP}_rst
export INMAPSET=$MYMAPSET
export OUTMAP=$INMAP
export GRASS_BATCH_JOB=/shareddisk/gcopyjob.sh
grass63 -text  /shareddisk/grassdata/myloc/$TARGETMAPSET
exit 0
------- snap ----------

You see, that I run GRASS twice. Note that you need GRASS 6.3 to
make use of GRASS_BATCH_JOB (if present, GRASS automatically
executes that job instead of launching the user interface.

The script gcopyjob.sh simply contains
------- snip -----------
g.copy rast=$INMAP@$INMAPSET,$OUTMAP --o
------- snap ----------

That's it!

You script suggestion is essentially right. Only, you would better get
recent GRASS 6.3 to avoid a nightmare :)


> In each mapset would then contain the spatial data that each process
> will use.  You suggest then copying the output into a single shared
> mapset such as PERMANENT.  For my purposes I'll probably just save them
> as text files (the data then gets transferred to another program for the
> next stages of processing).

Sure - as you prefer. I put the elaborated MODIS map into a single mapset
for easy takeaway in the end.

I have extended
http://grass.gdf-hannover.de/wiki/Parallel_GRASS_jobs

Cheers
Markus

--
Markus Neteler
Fondazione Mach  -  Centre for Alpine Ecology
38100 Viote del Monte Bondone (Trento), Italy
neteler AT cealp.it      http://www.cealp.it/


> Again many thanks,
>
>
_______________________________________________
grass-user mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-user
Reply | Threaded
Open this post in threaded view
|

Re: Multicore Processing and Temporary File Cleanup

joechip90
In reply to this post by Markus Neteler
Hi Markus,
 
Many thanks for all your help.  Thanks to your link to the wiki I've managed to get this 'poor man's paralellisation' up and running on our university's cluster and we've not had any problems thus far.

2008/12/26 Markus Neteler (via Nabble) <[hidden email]>
Joseph,

I am using a cluster right now which is based on PBS to elaborate MODIS
satellite data. Some answers below:

On Feb 13, 2008 2:43 PM, joechip90 <joechip90@...> wrote:
>
> Dear All,
>
> I have looked around on other postings and it appears that the majority (if
> not all) of the GRASS libraries are NOT thread safe.

Yes, unfortunately true.

> Unfortunately I have a
> very large processing job that would benefit from cluster processing.  I
> have written a script that can be run on multiple processors whilst being
> very careful not to allow different processes to try to modify the same data
> at any point.  The same raster file is not accessed by different processes
> at all in fact.

Yes, fine. Essentially there are at least two approaches of "poor man"
parallelization without modifying GRASS source code:

- split map into spatial chunks (possibly with overlap to gain smooth results)
- time series: run each map elaboration on a different node.

> However, I also realise that alone might not solve all my problems.  In any
> one process some temporary files are created (by GRASS libraries) and then
> these are deleted on statup (cleaning temporary files...).  Now I was
> wondering what these temporary files were and if there might be a problem
> with one process creating temporary files that it needs whilst another
> process starts up GRASS and deletes them.  Is there any way to call GRASS in
> a way that doesn't delete the temporary files?

You could just modify the start script and remove that call for "clean_temp".
BUT:
I am currently elaborating some thousand maps for the same region (time
series). I elaborate each map in the same location but a different mapset
(simply using the map name as mapset name). At the end of the elaboration I
call a second batch job which only contains g.copy to copy the result into a
common mapset. There is a low risk of race condition here in case that two
nodes finish at the same time but this could be even trapped in a loop which
checks if the target mapset is locked and, if needed, launches g.copy again till
success.

> I appreciate that I'm trying to do something that GRASS doesn't really
> support but I was hoping that it might be possible to fiddle around and find
> a way.  Any help would be gratefully received.

To some extend GRASS supports what you need.
I have drafted a related wiki page at:
http://grass.gdf-hannover.de/wiki/Parallel_GRASS_jobs

Feel free to hack that page!

Good luck,
Markus
_______________________________________________
grass-user mailing list
grass-user@...
http://lists.osgeo.org/mailman/listinfo/grass-user



This email is a reply to your post @ http://n2.nabble.com/Multicore-Processing-and-Temporary-File-Cleanup-tp1881768p1881769.html
You can reply by email or by visting the link above.