[gdal-dev] /vsicurl caching behavior

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[gdal-dev] /vsicurl caching behavior

pvalsecc
Hi,

I've been looking at the performance of GDAL in the context of MapServer and QGIS accessing GeoTiff on a HTTP server (S3, mostly).

Doing that, I had a good look at the cpl_vsil_curl.cpp file and the surroundings and that raised a few questions/concerns:

1) There are a lot of knobs that can be used to tune the thing that are not documented. For example CPL_VSIL_CURL_USE_CACHE. Is it on purpose?

2) The implementation of Add/GetRegionToCacheDisk is quite crude. Scanning the file sequentially to find the region is not very efficient, I guess. Are there any plan to improve that? Maybe a bit less crude with splitting the file in two: one that contains the index (hash+offset+size) and the other one the content. That way, the scanning of the index will be faster (contiguous in the disk and in cache). But that requires the usage of flock and its equivalent in other OSes.

3) The implementation of GetRegionFromCacheDisk has some efficiency problems. If the region is found, it calls AddRegion which in turn will call AddRegionToCacheDisk just to re-scan the file; where it will find the one GetRegionFromCacheDisk just searched and not add it one more time. So we scan sequentially the file twice.

4) There is no limit to the gdal_vsicurl_cache.bin file size. This makes this caching not very usable: risk of running out of disk, increasing slowness, no refresh of the data after some time.

5) There is no way to specify the location of gdal_vsicurl_cache.bin unless one does a chdir before calling GDAL.

6) If VSI_CACHE is enabled the data is cached twice in memory (papsRegions and VSICachedFile). Is it wanted?

7) If the file's content is modified, it's the total mess. We'll end up having portions of the file having the old data while the rest has the new data. I'm quite sure the GeoTiff we end up with won't be very valid.

8) In the case discussed in 7), CPL_VSIL_CURL_NON_CACHED will just purge the data from 1 the 3 caches: papsRegions. The vsil_cache and the disk will still cache the content.

Apart from that, I'm very impressed by the performance GDAL can get when accessing the data through HTTP and how easy it is to understand the code.  Kudos!

What do you guys think?

Thanks and CU


_______________________________________________
gdal-dev mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/gdal-dev
Reply | Threaded
Open this post in threaded view
|

Re: /vsicurl caching behavior

Even Rouault-2

Hi,

 

> 1) There are a lot of knobs that can be used to tune the thing that are not

> documented. For example CPL_VSIL_CURL_USE_CACHE. Is it on purpose?

 

Yes, the disk cache is an experiment that isn't used anywhere (from what I know) and likely not in a finished state as you noticed in your below points which are all valid and should be addressed if someone wanted to make it production ready.

 

>

> 6) If VSI_CACHE is enabled the data is cached twice in memory (papsRegions

> and VSICachedFile). Is it wanted?

 

The scope of the caches are not the same. papsRegions is a global cache shared by all /vsicurl/ handles, and persistant (in memory) on their closing (so that if the same filename is closed and re-opened in sequence, already read parts can be reused), whereas VSICachedFile is associated with a single file handle.

I guess there could be some optimizations to avoid those duplications, but that could complicate substantially the code which is already non trivial.

 

>

> 7) If the file's content is modified, it's the total mess. We'll end up

> having portions of the file having the old data while the rest has the new

> data. I'm quite sure the GeoTiff we end up with won't be very valid.

 

Indeed. But the mess would also happen with no caching mechanism if a file is modified while being read. Even for a local file, GDAL using glibc FILE buffering API, so even if you modify some portions of a GeoTIFF that haven't been read yet, but you already read closing regions, there's a chance, you'll read old data in part.

 

>

> 8) In the case discussed in 7), CPL_VSIL_CURL_NON_CACHED will just purge

> the data from 1 the 3 caches: papsRegions. The vsil_cache and the disk will

> still cache the content.

 

CPL_VSIL_CURL_NON_CACHED avoids the content of a file to be preserved in the papsRegion cache when a file handle is closed and re-opened. And VSICachedFile is only valid during the lifetime of the file handle. So I don't think there's an issue there. Perhaps the naming CPL_VSIL_CURL_NON_CACHED is a bit misleading: there's always some cache (otherwise /vsicurl performance would be just too horrible), it is just that it doesn't survive file closing.

 

Even

 

 

--

Spatialys - Geospatial professional services

http://www.spatialys.com


_______________________________________________
gdal-dev mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/gdal-dev
Reply | Threaded
Open this post in threaded view
|

Re: /vsicurl caching behavior

pvalsecc
Hi Even,

Long time no see!

On Fri, Jan 5, 2018 at 12:21 PM, Even Rouault <[hidden email]> wrote:

> 8) In the case discussed in 7), CPL_VSIL_CURL_NON_CACHED will just purge

> the data from 1 the 3 caches: papsRegions. The vsil_cache and the disk will

> still cache the content.

 

CPL_VSIL_CURL_NON_CACHED avoids the content of a file to be preserved in the papsRegion cache when a file handle is closed and re-opened. And VSICachedFile is only valid during the lifetime of the file handle. So I don't think there's an issue there. Perhaps the naming CPL_VSIL_CURL_NON_CACHED is a bit misleading: there's always some cache (otherwise /vsicurl performance would be just too horrible), it is just that it doesn't survive file closing.


OK, I understand now the VSICachedFile lifecycle and it makes sense. But then the disk cache should not be used at all for files in the CPL_VSIL_CURL_NON_CACHED variable.

I'll try to find some happy hacking time (or client funding, ideally) to improve the disk caching. What is the official GDAL procedure for PRs? I see a few in the github repo, but the doc says the source is located in SVN.

Thanks.

_______________________________________________
gdal-dev mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/gdal-dev
Reply | Threaded
Open this post in threaded view
|

Re: /vsicurl caching behavior

Even Rouault-2

 

> OK, I understand now the VSICachedFile lifecycle and it makes sense. But

> then the disk cache should not be used at all for files in

> the CPL_VSIL_CURL_NON_CACHED variable.

 

Indeed. The cache disk code hasn't been touched/tested in years, and CPL_VSIL_CURL_NON_CACHED was added afterwards.

 

>

> I'll try to find some happy hacking time (or client funding, ideally) to

> improve the disk caching. What is the official GDAL procedure for PRs? I

> see a few in the github repo, but the doc says the source is located in SVN.

 

The master is still in SVN (we will likely switch to full github workflow at some point) but contributions being sent as pull request against the github mirror make it easier for the person reviewing & merging them.

 

Even

 

--

Spatialys - Geospatial professional services

http://www.spatialys.com


_______________________________________________
gdal-dev mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/gdal-dev