[gdal-dev] Cannot open S3 files after upload

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[gdal-dev] Cannot open S3 files after upload

Matt Hanson-2
Hello everyone,

My actual problem is a bit more specific then being unable to open S3 files after upload. The actual problem is that within the same Python session, I can open a file off S3 with the vsis3 driver, but then if I upload a new file that previously did not exist (using boto3), gdal does not see it as a valid file. I originally encountered this problem in rasterio, and with gippy, but got the same problem when using gdal directly. 

I have an app that generates time series by calculating values from images off S3, however it also uploads files to S3 if they did not previously exist for that particular date. If all the files currently exist then there is no problem and they can be read fine. However, if a file is missing *and* the app has already read a file from S3, then it is unable to see the file as existing.

What appears to be happening is that once an S3 file is read the contents of that bucket are read into a cache, but then if an new file is uploaded in the meantime, trying to then read that file looks in the cache and doesn't see that file as existing and throws an error. If I recall correctly GDAL is reading other contents of that bucket/key-prefix because it's looking accompanying metadata files so is this cached in some way? It seemed like a plausible explanation but I've been unable to find reference to such a cache other than potentially VSI_CACHE, but setting that to FALSE did nothing and my understanding is that it applies to specific datasets, not bucket contents.

I've managed to replicate the problem in a very simple Python program below. While both files are uploaded without error (you can use gdalinfo remotely on both), the attempt to open the second file will throw:
ERROR 4: `/vsis3/pail-of-images/test2.tif' not recognized as a supported file format.

Calling the script a second time works, because (presumably) even though it uploads and overwrites both images again, they both exist from the beginning.

Either this is a bug or it's intended behavior in which case there's hopefully some way to change it to force to reread a bucket when trying to open a file. My current workaround is to change the behavior of my app to upload all images first before accessing, but this seems unsatisfactory, not to mention it wreaks havoc with my tests which don't assume such behavior.

Suggestions very welcome, been banging my head on this for a couple days.

Tested with both Python v2.7 and 3.5, and with gdal 2.1.3 and gdal 2.2.0, with Docker, without Docker, and on both Ubuntu and OSX.

########################
#!/usr/bin/env python3

from osgeo import gdal
import boto3

filenames = [
    'file1.tif',
    'file2.tif'
]

bucket = 'pail-of-images'

s3 = boto3.resource('s3')
for f in filenames:
    print('Uploading %s to %s' % (f, bucket))
    s3.meta.client.upload_file(f, bucket, f)
    uri = '/vsis3/%s/%s' % (bucket, f)
    print('Opening %s' % uri)
    ds = gdal.Open(uri)
    print(ds.GetMetadata())
    ds = None
##########################

Matthew Hanson
Development Seed

_______________________________________________
gdal-dev mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/gdal-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cannot open S3 files after upload

Even Rouault-2

Matt,

 

> My actual problem is a bit more specific then being unable to open S3 files

> after upload. The actual problem is that within the same Python session, I

> can open a file off S3 with the vsis3 driver, but then if I upload a new

> file that previously did not exist (using boto3), gdal does not see it as a

> valid file.

 

Yes I'm aware of that issue. There's indeed metadata (file size & date, directory listing) and data (chunks of files) cached by /vsicurl/ and related file systems like /vsis3/ . /vsicurl/ was designed at a time where web resources didn't change that much and it was unlikely during a same GDAL session to see changes, but with cloud offerings, this is no longer the case.

 

A few weeks ago I've added in trunk a CPL_VSIL_CURL_NON_CACHED config option that can be set to disable caching on a file or set of files.

See https://trac.osgeo.org/gdal/wiki/ConfigOptions#CPL_VSIL_CURL_NON_CACHED

 

So in your example, if you set

CPL_VSIL_CURL_NON_CACHED=/vsis3/put_here_the_bucket_name , that will work.

 

I've also just added per https://trac.osgeo.org/gdal/ticket/6937 a new function VSICurlClearCache() function (bound to SWIG as gdal.VSICurlClearCache()). So if you add gdal.VSICurlClearCache() just after the s3.meta.client.upload_file() call, that will work too.

 

Both mechanisms are complementary.

 

CPL_VSIL_CURL_NON_CACHED is useful in scenarios where you don't know when the server content can change (some other processes or machines do that behind your back). Its advantage is that it doesn't require modification of code (it was designed for MapServer use case typically). The drawback of it is that you loose all caching when a same file is opened, close, opened, closed, ... several times during the process.

 

VSICurlClearCache() will give you more control if you master when uploads happen.

 

I've also backported VSICurlClearCache() to 2.2 branch.

 

As far as VSI_CACHE=TRUE is concerned, its scope of caching is restricted to a same VSI file handle instance. Can be useful if the global 16 MB vsicurl cache isn't big enough for very large files.

 

Even

 

--

Spatialys - Geospatial professional services

http://www.spatialys.com


_______________________________________________
gdal-dev mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/gdal-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cannot open S3 files after upload

Matt Hanson-2
Thanks Even,

Disabling reading the directory is another work around for my use case as well:
(GDAL_DISABLE_READDIR_ON_OPEN=TRUE)

On Wed, Jun 21, 2017 at 5:02 AM, Even Rouault <[hidden email]> wrote:

Matt,

 

> My actual problem is a bit more specific then being unable to open S3 files

> after upload. The actual problem is that within the same Python session, I

> can open a file off S3 with the vsis3 driver, but then if I upload a new

> file that previously did not exist (using boto3), gdal does not see it as a

> valid file.

 

Yes I'm aware of that issue. There's indeed metadata (file size & date, directory listing) and data (chunks of files) cached by /vsicurl/ and related file systems like /vsis3/ . /vsicurl/ was designed at a time where web resources didn't change that much and it was unlikely during a same GDAL session to see changes, but with cloud offerings, this is no longer the case.

 

A few weeks ago I've added in trunk a CPL_VSIL_CURL_NON_CACHED config option that can be set to disable caching on a file or set of files.

See https://trac.osgeo.org/gdal/wiki/ConfigOptions#CPL_VSIL_CURL_NON_CACHED

 

So in your example, if you set

CPL_VSIL_CURL_NON_CACHED=/vsis3/put_here_the_bucket_name , that will work.

 

I've also just added per https://trac.osgeo.org/gdal/ticket/6937 a new function VSICurlClearCache() function (bound to SWIG as gdal.VSICurlClearCache()). So if you add gdal.VSICurlClearCache() just after the s3.meta.client.upload_file() call, that will work too.

 

Both mechanisms are complementary.

 

CPL_VSIL_CURL_NON_CACHED is useful in scenarios where you don't know when the server content can change (some other processes or machines do that behind your back). Its advantage is that it doesn't require modification of code (it was designed for MapServer use case typically). The drawback of it is that you loose all caching when a same file is opened, close, opened, closed, ... several times during the process.

 

VSICurlClearCache() will give you more control if you master when uploads happen.

 

I've also backported VSICurlClearCache() to 2.2 branch.

 

As far as VSI_CACHE=TRUE is concerned, its scope of caching is restricted to a same VSI file handle instance. Can be useful if the global 16 MB vsicurl cache isn't big enough for very large files.

 

Even

 

--

Spatialys - Geospatial professional services

http://www.spatialys.com



_______________________________________________
gdal-dev mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/gdal-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cannot open S3 files after upload

Even Rouault-2

On mercredi 21 juin 2017 11:47:49 CEST Matt Hanson wrote:

> Thanks Even,

>

> Disabling reading the directory is another work around for my use case as

> well:

> (GDAL_DISABLE_READDIR_ON_OPEN=TRUE)

 

This can work, but will cause probing of lots of side-car files.

 

If you don't have any side-car files related to your .tif, try :

GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR

 

>

> On Wed, Jun 21, 2017 at 5:02 AM, Even Rouault <[hidden email]>

>

> wrote:

> > Matt,

> >

> > > My actual problem is a bit more specific then being unable to open S3

> >

> > files

> >

> > > after upload. The actual problem is that within the same Python session,

> >

> > I

> >

> > > can open a file off S3 with the vsis3 driver, but then if I upload a new

> > >

> > > file that previously did not exist (using boto3), gdal does not see it

> >

> > as a

> >

> > > valid file.

> >

> > Yes I'm aware of that issue. There's indeed metadata (file size & date,

> > directory listing) and data (chunks of files) cached by /vsicurl/ and

> > related file systems like /vsis3/ . /vsicurl/ was designed at a time where

> > web resources didn't change that much and it was unlikely during a same

> > GDAL session to see changes, but with cloud offerings, this is no longer

> > the case.

> >

> >

> >

> > A few weeks ago I've added in trunk a CPL_VSIL_CURL_NON_CACHED config

> > option that can be set to disable caching on a file or set of files.

> >

> > See https://trac.osgeo.org/gdal/wiki/ConfigOptions#CPL_VSIL_

> > CURL_NON_CACHED

> >

> >

> >

> > So in your example, if you set

> >

> > CPL_VSIL_CURL_NON_CACHED=/vsis3/put_here_the_bucket_name , that will work.

> >

> >

> >

> > I've also just added per https://trac.osgeo.org/gdal/ticket/6937 a new

> > function VSICurlClearCache() function (bound to SWIG as

> > gdal.VSICurlClearCache()). So if you add gdal.VSICurlClearCache() just

> > after the s3.meta.client.upload_file() call, that will work too.

> >

> >

> >

> > Both mechanisms are complementary.

> >

> >

> >

> > CPL_VSIL_CURL_NON_CACHED is useful in scenarios where you don't know when

> > the server content can change (some other processes or machines do that

> > behind your back). Its advantage is that it doesn't require modification

> > of

> > code (it was designed for MapServer use case typically). The drawback of

> > it

> > is that you loose all caching when a same file is opened, close, opened,

> > closed, ... several times during the process.

> >

> >

> >

> > VSICurlClearCache() will give you more control if you master when uploads

> > happen.

> >

> >

> >

> > I've also backported VSICurlClearCache() to 2.2 branch.

> >

> >

> >

> > As far as VSI_CACHE=TRUE is concerned, its scope of caching is restricted

> > to a same VSI file handle instance. Can be useful if the global 16 MB

> > vsicurl cache isn't big enough for very large files.

> >

> >

> >

> > Even

> >

> >

> >

> > --

> >

> > Spatialys - Geospatial professional services

> >

> > http://www.spatialys.com

 

 

--

Spatialys - Geospatial professional services

http://www.spatialys.com


_______________________________________________
gdal-dev mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/gdal-dev
Loading...