[gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format

Björn Harrtell
Hi GDAL/OGR folks,

In my spare time I've been working on a vector file format called FlatGeobuf (tentatively). The main reason, besides curiosity and learning, I'm putting time into it is that I think shapefile still rules the read/query static data performance game, which is kind of sad, and probably one of the reasons it is still so widely popular. Also, the main competitor (GeoPackage) isn't suitable for streaming access (AFAIK) which shapefiles also handles surprisingly (?) well.

By using a performance focused write once binary encoding (flatbuffers), schema constraint and focusing on read/query performance by clustering on an optimal spatial index (Packed Hilbert R-Tree) I hope to be able to beat shapefile performance and at the same time be optimal for streaming/cloud access.

I think I'm starting to get somewhere, more details and source is at https://github.com/bjornharrtell/flatgeobuf and I have an early proof of concept driver implementation at https://github.com/bjornharrtell/gdal/tree/flatgeobuf and results are already quite promising - it can do roundtrip read/write and is already quite a bit faster than shapefile. I also have implemented naive read only QGIS provider for experimental purposes.

Basically I'm fishing for interest in this effort, hoping that others will see potential in it even if it's "yet another format" and far from final/stable yet. Any feedback is welcome. As I see it, GDAL is a good place for a format specification and reference implementation to incubate.

Some additional food for thought that I've had during the experimentation:

1. The main in memory representations of geometry/coordinates seem to be either separate arrays per dimension (GDAL (partially?) and QGIS) or a single array with dimension as stride. I've chosen the latter as of yet but I'm still a bit undecided. There is of course a substantial involved in transforming between the two representations so the situation with two competing models is a bit unfortunate. I'm also not sure about which of these models are objectively "better" than the other?

2. One could get better space efficiency with protobuf instead of flatbuffers, but it has a performance cost. I have not gone far into investigating how much though and one could even reason about supporting both these encodings in a single format specification but it's taking it a bit far I think.

3. Is there a reason to support different packing strategies for the R-Tree or is Packed Hilbert a good compromise (besides it being beautifully simple to implement)?

4. FlatGeobuf is perhaps a too technical name, not catchy enough and has a bad/boring abbreviation. Other candidates I've considered are OpenGeoFile, OpenVectorFile or OpenSpatialFile but I'm undecided. Any ideas? :)

Regards,

Björn Harrtell

_______________________________________________
gdal-dev mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/gdal-dev
Reply | Threaded
Open this post in threaded view
|

Re: FlatGeobuf; proposal for a new performance oriented vector file format

Even Rouault-2
Björn,

> In my spare time I've been working on a vector file format called
> FlatGeobuf (tentatively). The main reason, besides curiosity and learning,
> I'm putting time into it is that I think shapefile still rules the
> read/query static data performance game, which is kind of sad, and probably
> one of the reasons it is still so widely popular. Also, the main competitor
> (GeoPackage) isn't suitable for streaming access (AFAIK)

I suspect that you could organize the layout of a sqlite3 file to be streaming
friendly, but the sqlite3 lib itself is probably not ready for that (or you
have to cheat by buffering a sufficiently large number of bytes in memory and
use a custom VFS to read in that buffer. at that point, implementing your own
dedicated sqlite3 reader might be better. likely doable, but not trivial).
Random access is possible (you can use /vsicurl/ etc on a geopackage), but it
might involve a number of seeks and small reads in the B-Tree / R-Tree.

That raises a major question. Which use case(s) do you want to address
specifically. From what I've understood, for network access:
- streaming access for progressive display as bytes come in
- efficient bbox queries with minimized number of bytes and ranges in the
files to read (minimizing the number of ranges of bytes is probably the most
important criterion since reading 1 byte or 1000 will take the same time)

> I think I'm starting to get somewhere, more details and source is at
> https://github.com/bjornharrtell/flatgeobuf and I have an early proof of
> concept driver implementation at
> https://github.com/bjornharrtell/gdal/tree/flatgeobuf and results are
> already quite promising - it can do roundtrip read/write and is already
> quite a bit faster than shapefile. I also have implemented naive read only
> QGIS provider for experimental purposes.

What are the I and O in the I+O related to the R-Tree index ?

Wondering if a 3D index could be an option in case you want to address the
full 3D case at some point. But might be something for later.

I'm not familiar with flatbuf, but is random access in the Feature table by
feature index is possible (without a preliminary scan pass), similarly to a
.shx file in a shapefile ?

Just a detail: it could be nice to have some global flag in the header that
would mean "index of any feature = its FID, its FID - 1, or no particular
correlation between feature index and feature FID"

For completness of the attribute types, you could have date, time and binary.
Can the concept of null value or empty value or both for a field value be
encoded ?

The Column structure could also receive more optional info: a long
description, maximum length for string, optional precision/scale formatting
for those nostalgic of decimal formatting

If you want full genericity to express a SRS, allowing a WKT CRS string as an
alternative for authority+code.

>
> Basically I'm fishing for interest in this effort, hoping that others will
> see potential in it even if it's "yet another format" and far from
> final/stable yet. Any feedback is welcome. As I see it, GDAL is a good
> place for a format specification and reference implementation to incubate.
>
> Some additional food for thought that I've had during the experimentation:
>
> 1. The main in memory representations of geometry/coordinates seem to be
> either separate arrays per dimension (GDAL (partially?) and QGIS) or a
> single array with dimension as stride. I've chosen the latter as of yet but
> I'm still a bit undecided. There is of course a substantial involved in
> transforming between the two representations so the situation with two
> competing models is a bit unfortunate. I'm also not sure about which of
> these models are objectively "better" than the other?

Why not just using WKB encoding since it has likely similar size and
performance characteristics to the flatbuf encoding, with the main advantage
of being widely implemented and supporting other geometry types like
CircularString, PolyhedralSurface, etc..., which you need if you want to fully
compete with GeoPackage ?

>
> 2. One could get better space efficiency with protobuf instead of
> flatbuffers, but it has a performance cost. I have not gone far into
> investigating how much though and one could even reason about supporting
> both these encodings in a single format specification but it's taking it a
> bit far I think.

Contradicts a bit my above point, but if you decide for a custom geometry
encoding, why not allowing int32 for vertex values, with a offset+scale
setting ? (ala OpenStreetmap PBF)

If the main use case you want to address if cloud use, I was wondering if it
would make sense to add a compression layer (ZSTD compress each Feature ?, or
a group of features) to reduce the time spent in waiting for data from the
network. Or perhaps not, and just let the transport layer do the compression
(HTTP Encoding: gzip)

> 4. FlatGeobuf is perhaps a too technical name, not catchy enough and has a
> bad/boring abbreviation. Other candidates I've considered are OpenGeoFile,
> OpenVectorFile or OpenSpatialFile but I'm undecided. Any ideas? :)

COF = Cloud Optimized Features ?

Even

--
Spatialys - Geospatial professional services
http://www.spatialys.com
_______________________________________________
gdal-dev mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/gdal-dev
Reply | Threaded
Open this post in threaded view
|

Re: FlatGeobuf; proposal for a new performance oriented vector file format

Björn Harrtell

Thanks Even, answers inlined.

Den mån 10 dec. 2018 kl 13:19 skrev Even Rouault <[hidden email]>:
Björn,

> In my spare time I've been working on a vector file format called
> FlatGeobuf (tentatively). The main reason, besides curiosity and learning,
> I'm putting time into it is that I think shapefile still rules the
> read/query static data performance game, which is kind of sad, and probably
> one of the reasons it is still so widely popular. Also, the main competitor
> (GeoPackage) isn't suitable for streaming access (AFAIK)

I suspect that you could organize the layout of a sqlite3 file to be streaming
friendly, but the sqlite3 lib itself is probably not ready for that (or you
have to cheat by buffering a sufficiently large number of bytes in memory and
use a custom VFS to read in that buffer. at that point, implementing your own
dedicated sqlite3 reader might be better. likely doable, but not trivial).
Random access is possible (you can use /vsicurl/ etc on a geopackage), but it
might involve a number of seeks and small reads in the B-Tree / R-Tree. 

That raises a major question. Which use case(s) do you want to address
specifically. From what I've understood, for network access:
- streaming access for progressive display as bytes come in
- efficient bbox queries with minimized number of bytes and ranges in the
files to read (minimizing the number of ranges of bytes is probably the most
important criterion since reading 1 byte or 1000 will take the same time)

The use case I had in mind is not the lossy and render optimized one, for that use case I think vector tiles are a good design. What I'm aiming for is essentially something to replace the shapefile i.e a lossless container of simple features with as good as possible performance for bbox query reads or full dataset reads via network or other means of I/O without having to deal with preprocessing into tiles, generalization etc.
 
> I think I'm starting to get somewhere, more details and source is at
> https://github.com/bjornharrtell/flatgeobuf and I have an early proof of
> concept driver implementation at
> https://github.com/bjornharrtell/gdal/tree/flatgeobuf and results are
> already quite promising - it can do roundtrip read/write and is already
> quite a bit faster than shapefile. I also have implemented naive read only
> QGIS provider for experimental purposes.

What are the I and O in the I+O related to the R-Tree index ?

I symbolizes the the R-Tree nodes. O is a separate array with feature offsets (the byte offset of each feature), so that together you can quickly get the byte ranges that needs to be fetched for a spatial query.
 
Wondering if a 3D index could be an option in case you want to address the
full 3D case at some point. But might be something for later.

I'm not familiar with flatbuf, but is random access in the Feature table by
feature index is possible (without a preliminary scan pass), similarly to a
.shx file in a shapefile ?

That is the purpose of the O as explained above. Each feature is a separate flatbuffer message and can be accessed directly.
 
Just a detail: it could be nice to have some global flag in the header that
would mean "index of any feature = its FID, its FID - 1, or no particular
correlation between feature index and feature FID"

Not sure exactly what you mean, but I've considered having an optional FID index to support fast random access by FID in the cases where FID is not the same as the index of the feature and I guess what you are saying it this should be explicit. This is not yet added to the spec.
 
For completness of the attribute types, you could have date, time and binary.
Can the concept of null value or empty value or both for a field value be
encoded ?

Yes, perhaps it would be useful to have dedicated types for date, time and binary. I recently added datetime.

The columns/field definition is static for the layer. Values are required to specify a column index. Null/missing values are represented by simply omitting values for column indexes. An empty values array for a feature = all values are null.

The Column structure could also receive more optional info: a long
description, maximum length for string, optional precision/scale formatting
for those nostalgic of decimal formatting

Agreed.
 
If you want full genericity to express a SRS, allowing a WKT CRS string as an
alternative for authority+code.

Agreed, I should consider it.
 
>
> Basically I'm fishing for interest in this effort, hoping that others will
> see potential in it even if it's "yet another format" and far from
> final/stable yet. Any feedback is welcome. As I see it, GDAL is a good
> place for a format specification and reference implementation to incubate.
>
> Some additional food for thought that I've had during the experimentation:
>
> 1. The main in memory representations of geometry/coordinates seem to be
> either separate arrays per dimension (GDAL (partially?) and QGIS) or a
> single array with dimension as stride. I've chosen the latter as of yet but
> I'm still a bit undecided. There is of course a substantial involved in
> transforming between the two representations so the situation with two
> competing models is a bit unfortunate. I'm also not sure about which of
> these models are objectively "better" than the other?

Why not just using WKB encoding since it has likely similar size and
performance characteristics to the flatbuf encoding, with the main advantage
of being widely implemented and supporting other geometry types like
CircularString, PolyhedralSurface, etc..., which you need if you want to fully
compete with GeoPackage ?

I think I've (perhaps prematurely) ruled out WKB because I find it not very well/accessibly specified in it's details and existing implementations rather complex, so you might be right here. I'm however not sure about supporting any other geometry types than the ones I already do (similar as shapefile) to constrain the complexity.
 
>
> 2. One could get better space efficiency with protobuf instead of
> flatbuffers, but it has a performance cost. I have not gone far into
> investigating how much though and one could even reason about supporting
> both these encodings in a single format specification but it's taking it a
> bit far I think.

Contradicts a bit my above point, but if you decide for a custom geometry
encoding, why not allowing int32 for vertex values, with a offset+scale
setting ? (ala OpenStreetmap PBF)

That's what geobuf does for space efficiency but it seems it can cause drift in some corner cases (see https://github.com/mapbox/geobuf/issues/96) so I decided not to dive into it as I also wanted to try and aim for a zero-copy encoding.
 
If the main use case you want to address if cloud use, I was wondering if it
would make sense to add a compression layer (ZSTD compress each Feature ?, or
a group of features) to reduce the time spent in waiting for data from the
network. Or perhaps not, and just let the transport layer do the compression
(HTTP Encoding: gzip)

I have thought about it and container / transport layer compression seems preferable to me.
 
> 4. FlatGeobuf is perhaps a too technical name, not catchy enough and has a
> bad/boring abbreviation. Other candidates I've considered are OpenGeoFile,
> OpenVectorFile or OpenSpatialFile but I'm undecided. Any ideas? :)

COF = Cloud Optimized Features ?

Hmm, not bad :) I haven't considered cloud use the main/only use case for the format, but also offline applications so I'm not entirely convinced (yet).
 

Even

--
Spatialys - Geospatial professional services
http://www.spatialys.com
_______________________________________________
gdal-dev mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/gdal-dev

_______________________________________________
gdal-dev mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/gdal-dev
Reply | Threaded
Open this post in threaded view
|

Re: FlatGeobuf; proposal for a new performance oriented vector file format

Björn Harrtell
In reply to this post by Even Rouault-2
Hi Benjamin,

Very interesting to read that you have experimented in similar things and of your positive experiences with flatbuffers.

Den mån 10 dec. 2018 kl 18:18 skrev Benjamin Stadin <[hidden email]>:
Björn,

Interesting to see there is some progress in this area. We've written a Flatbuffers based vector tile schema about 2 1/2 years ago, which is used in our rendering product (3D map rendering engine). I can share some thoughts about the general pros and cons, and what I'd do different (from today's perspective).

Just last week I also started experimenting with using our tile format within QGIS, which would work similar to the MVT QGIS plugin [1] which uses web meractor tiling. It will parse our "HSG" flatbuffers cells (aka tiles), convert them on the fly into GeoJSON and pass them to QGIS.

Some thoughts about your current implementation:

1) I second Even's question about the actual use-case. I think a traditional tile-based approach will be more efficient for bbox querying, and client-side visualization. I'll share my approach later below.

I agree and perhaps the spatial index part needs more consideration but it's an important point that the format I'm trying to make is not something to compete with vector tiles and render optimized representations - my aim is for the case where you want/need to work with simple features source data in lossless representation i.e more or less the same case as when directly using shapefiles.
 
2) 2 1/2 years and quite some projects later, I still prefer Flatbuffers over protobuf for a number of reasons. Starting with direct memory access, easier client side parser implementation (structs instead of strange parallel arrays and such), performance. But it has one limitation in regards to memory usage and file size. Flatbuffers will require more memory (up to the size of the flatbuffers raw data) when you create or read a large flatbuffers file. This is due to the direct memory access capabilities. You can create a size-prefixed flatbuffers file (I've asked a question here [2]). But this will have negative side effects for your r-tree based approach. You'll need to read larger chunks (thus more unrelated data) to stream individual features. For streaming bbox queries a tile based approach is still more efficient, and simpler. Data in one area is packed into one memory region, and can be queried using common tile pyramid techniques (or variations like ours).

I haven't dived deep enough into this yet but my hope is that since I constrain my format to be non-updatable I can construct and assume I have a balanced R-tree with a known size and structure and as such I should not need to read more of it than needed for traversal into the relevant part of the tree and then I can use the offset index to get byte offsets to each feature out of a tree query result. Because I also require the features to be written in the order the R-tree has been built they will be clustered on R-tree nodes so that most queries should result in small sets of sequential reads to get the features.
 
3) The r-rtee implementation might be sufficient for simple use cases. But when later requiring to deal with 3D data and large data sets (like OSM) the implementation will become more complex, and would require to handle efficient disk based indexing instead of having to read the whole index into memory (I think the memory footprint for the index of the OSM data set will be in the region of 1GB++). In the end the advantage might not be so big compared to e.g. SQLite's R-Tree virtual table. It's actually a quite fast implementation, considering its low memory footprint (by cost of disk accesses).

Agreed that I might have underestimated the complexity of a partial buffered tree traversal, but consider the above answer.
 
4) Seconding Even's suggestion about using WKB. Our implementation is also related to WKB types, but we defined the geometry types as part of the schema similar to yours. It works fine, but today I'd choose a byte array for WKB data for generic use and easier integration into existing software (no need to write a custom geometry parser for any language, WKB readers already exist which do that for you).

While I mosty agree I'm not entirely convinced, see answer to Even.
 
5) In case the vector format is supposed for client-side rendering, how to handle zoom levels properly?

I view this as out of scope for the format.
 
From an adoptability and maintenance point of view, I'd try everything possible to avoid recreating too much. I think an "integrative" approach is a better choice. E.g. creating a Geopackage extension for the vector tiles format analog to WebP, using WKB as byte array, and referring to the common tile pyramid scheme for easier implementation and integration possibilities for client sider rendering.

While that sounds like an interesting idea for vector tile distribution, which I agree is kind of poor today, it's not the aim of this format to compete with vector tiles as I've hopefully been able to clarify now. :)
 
Some words about our "Heidelberg Sinusoidal Grid" (HSG) implementation. We basically faced three problems with existing approaches for our use-cases:
- We need different projections for accurate indoor (and outdoor) models. Existing vector tile approaches are either optimized for rendering (e.g. projection depending and clipping geometry data at tile bounds like MVT), or too proprietary / not widely adopted and thus more complex to integrate with common rendering approaches (like Googles Earth's format, forgotten it's name) and thereby then limiting their use in existing processing chains and editing tools.
- We make heavy use of feature querying and interaction, where the full feature is required instead of drawing commands (like with MVT).
- (New) Mixing common xyz tiling scheme with "real feature" (not rendering optimized, clipped) vector data will eventually allow for efficient streaming (bbox) and editing capabilities. E.g. I'm thinking of using our vector tiles and simplify / strip non-useful data at lower zoom levels, and using non-simplified data at upper zoom levels where it can be edited in-place for example in QGIS.

In brief this is how it works:
To be independent of a projection, and to allow for storing the whole globe, we use a variation of the equal area MODIS grid for the tile pyramid. So instead of using the actual projection the client(s) use a helper routine to calculate the "directly" required HSG tiles. Similar to how a web map would calculate the web Mercator tiles required for the current view.

There are however two key differences: The returned tiles (we call them actually cells for a reason) only loosely relate to the actually used projection: You will be provided the HSG cells (+ small buffer) covering your current client view with complete and unaltered features and geometry data regardless of what projection the geometries are actually in. So the grid is actually used for bbox (tile-pyramid alike) indexing and querying, and does not care about the projection or data stored inside individual cells (the projection is considered once when the storage is created and features get assigned to a cell).

The other difference to common tiling is that you will also need to load implicitly required cells for features overlapping a cell in current direct sight: Cells contain meta info about related cells (cells containing at least one feature overlapping this cell). Constructing the storage is therefore a bit more complex, but we can thereby avoid the need for a client-side index like a r-tree.

I'm willing to open source that, but there hasn't been much interest in this. I'm interested in participating to define a next gen vector file format as well.

My spontaneous reaction to "HSG" is that it seems rather complex and perhaps that it is aiming at other use case(s) than I had in mind.
 

Cheers
Ben

[1] https://github.com/geometalab/Vector-Tiles-Reader-QGIS-Plugin
[2] https://stackoverflow.com/questions/35322700/flatbuffers-how-to-write-giant-files


Am 10.12.18, 13:19 schrieb "gdal-dev im Auftrag von Even Rouault" <[hidden email] im Auftrag von [hidden email]>:

    Björn,

    > In my spare time I've been working on a vector file format called
    > FlatGeobuf (tentatively). The main reason, besides curiosity and learning,
    > I'm putting time into it is that I think shapefile still rules the
    > read/query static data performance game, which is kind of sad, and probably
    > one of the reasons it is still so widely popular. Also, the main competitor
    > (GeoPackage) isn't suitable for streaming access (AFAIK)

    I suspect that you could organize the layout of a sqlite3 file to be streaming
    friendly, but the sqlite3 lib itself is probably not ready for that (or you
    have to cheat by buffering a sufficiently large number of bytes in memory and
    use a custom VFS to read in that buffer. at that point, implementing your own
    dedicated sqlite3 reader might be better. likely doable, but not trivial).
    Random access is possible (you can use /vsicurl/ etc on a geopackage), but it
    might involve a number of seeks and small reads in the B-Tree / R-Tree.

    That raises a major question. Which use case(s) do you want to address
    specifically. From what I've understood, for network access:
    - streaming access for progressive display as bytes come in
    - efficient bbox queries with minimized number of bytes and ranges in the
    files to read (minimizing the number of ranges of bytes is probably the most
    important criterion since reading 1 byte or 1000 will take the same time)

    > I think I'm starting to get somewhere, more details and source is at
    > https://github.com/bjornharrtell/flatgeobuf and I have an early proof of
    > concept driver implementation at
    > https://github.com/bjornharrtell/gdal/tree/flatgeobuf and results are
    > already quite promising - it can do roundtrip read/write and is already
    > quite a bit faster than shapefile. I also have implemented naive read only
    > QGIS provider for experimental purposes.

    What are the I and O in the I+O related to the R-Tree index ?

    Wondering if a 3D index could be an option in case you want to address the
    full 3D case at some point. But might be something for later.

    I'm not familiar with flatbuf, but is random access in the Feature table by
    feature index is possible (without a preliminary scan pass), similarly to a
    .shx file in a shapefile ?

    Just a detail: it could be nice to have some global flag in the header that
    would mean "index of any feature = its FID, its FID - 1, or no particular
    correlation between feature index and feature FID"

    For completness of the attribute types, you could have date, time and binary.
    Can the concept of null value or empty value or both for a field value be
    encoded ?

    The Column structure could also receive more optional info: a long
    description, maximum length for string, optional precision/scale formatting
    for those nostalgic of decimal formatting

    If you want full genericity to express a SRS, allowing a WKT CRS string as an
    alternative for authority+code.

    >
    > Basically I'm fishing for interest in this effort, hoping that others will
    > see potential in it even if it's "yet another format" and far from
    > final/stable yet. Any feedback is welcome. As I see it, GDAL is a good
    > place for a format specification and reference implementation to incubate.
    >
    > Some additional food for thought that I've had during the experimentation:
    >
    > 1. The main in memory representations of geometry/coordinates seem to be
    > either separate arrays per dimension (GDAL (partially?) and QGIS) or a
    > single array with dimension as stride. I've chosen the latter as of yet but
    > I'm still a bit undecided. There is of course a substantial involved in
    > transforming between the two representations so the situation with two
    > competing models is a bit unfortunate. I'm also not sure about which of
    > these models are objectively "better" than the other?

    Why not just using WKB encoding since it has likely similar size and
    performance characteristics to the flatbuf encoding, with the main advantage
    of being widely implemented and supporting other geometry types like
    CircularString, PolyhedralSurface, etc..., which you need if you want to fully
    compete with GeoPackage ?

    >
    > 2. One could get better space efficiency with protobuf instead of
    > flatbuffers, but it has a performance cost. I have not gone far into
    > investigating how much though and one could even reason about supporting
    > both these encodings in a single format specification but it's taking it a
    > bit far I think.

    Contradicts a bit my above point, but if you decide for a custom geometry
    encoding, why not allowing int32 for vertex values, with a offset+scale
    setting ? (ala OpenStreetmap PBF)

    If the main use case you want to address if cloud use, I was wondering if it
    would make sense to add a compression layer (ZSTD compress each Feature ?, or
    a group of features) to reduce the time spent in waiting for data from the
    network. Or perhaps not, and just let the transport layer do the compression
    (HTTP Encoding: gzip)

    > 4. FlatGeobuf is perhaps a too technical name, not catchy enough and has a
    > bad/boring abbreviation. Other candidates I've considered are OpenGeoFile,
    > OpenVectorFile or OpenSpatialFile but I'm undecided. Any ideas? :)

    COF = Cloud Optimized Features ?

    Even

    --
    Spatialys - Geospatial professional services
    http://www.spatialys.com
    _______________________________________________
    gdal-dev mailing list
    [hidden email]
    https://lists.osgeo.org/mailman/listinfo/gdal-dev


_______________________________________________
gdal-dev mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/gdal-dev
Reply | Threaded
Open this post in threaded view
|

Re: FlatGeobuf; proposal for a new performance oriented vector file format

Björn Harrtell
In reply to this post by Björn Harrtell
After posting about my experimental format I realized that I lack numbers on the potential performance, so I tried to make some more or less scientific measuring. The results was disappointing, reaching similar performance as shapefile for full sequential reads and so I lost interest for a while. But recently I found out that my method of measuring was flawed - I measured the time to read into a new shapefile using ogr2ogr, but it turns out that can be quite unfair due to the string field length dynamic resize that ogr2ogr will do if the strings from input feature attributes are of variable length. I've now made some new measurements using ogrinfo and the undocumented flag -qq to get a full sequential read.

I've used the shapefiles "gis_osm_buildings_a_free_1" (exploded to single poly) and "gis_osm_roads_free_1" from Denmark at http://download.geofabrik.de/europe.html as the source, converted to respective format, then measured the time it takes to run ogrinfo -qq on them. Results (average over three runs):

## gis_osm_buildings_a_free_1 (2027890 Polygons)
* SHP: 3800 ms
* GPKG: 2700 ms
* FlatGeobuf: 2200 ms

## gis_osm_roads_free_1 (812547 LineStrings)
* SHP: 1600 ms
* GPKG: 1350 ms
* FlatGeobuf: 800 ms

With that my hope and interest is rekindled. I believe I can fare even better against the competition with with spatially filtered searches, will hopefully soon have some results on that too.

/Björn

Den sön 9 dec. 2018 kl 20:36 skrev Björn Harrtell <[hidden email]>:
Hi GDAL/OGR folks,

In my spare time I've been working on a vector file format called FlatGeobuf (tentatively). The main reason, besides curiosity and learning, I'm putting time into it is that I think shapefile still rules the read/query static data performance game, which is kind of sad, and probably one of the reasons it is still so widely popular. Also, the main competitor (GeoPackage) isn't suitable for streaming access (AFAIK) which shapefiles also handles surprisingly (?) well.

By using a performance focused write once binary encoding (flatbuffers), schema constraint and focusing on read/query performance by clustering on an optimal spatial index (Packed Hilbert R-Tree) I hope to be able to beat shapefile performance and at the same time be optimal for streaming/cloud access.

I think I'm starting to get somewhere, more details and source is at https://github.com/bjornharrtell/flatgeobuf and I have an early proof of concept driver implementation at https://github.com/bjornharrtell/gdal/tree/flatgeobuf and results are already quite promising - it can do roundtrip read/write and is already quite a bit faster than shapefile. I also have implemented naive read only QGIS provider for experimental purposes.

Basically I'm fishing for interest in this effort, hoping that others will see potential in it even if it's "yet another format" and far from final/stable yet. Any feedback is welcome. As I see it, GDAL is a good place for a format specification and reference implementation to incubate.

Some additional food for thought that I've had during the experimentation:

1. The main in memory representations of geometry/coordinates seem to be either separate arrays per dimension (GDAL (partially?) and QGIS) or a single array with dimension as stride. I've chosen the latter as of yet but I'm still a bit undecided. There is of course a substantial involved in transforming between the two representations so the situation with two competing models is a bit unfortunate. I'm also not sure about which of these models are objectively "better" than the other?

2. One could get better space efficiency with protobuf instead of flatbuffers, but it has a performance cost. I have not gone far into investigating how much though and one could even reason about supporting both these encodings in a single format specification but it's taking it a bit far I think.

3. Is there a reason to support different packing strategies for the R-Tree or is Packed Hilbert a good compromise (besides it being beautifully simple to implement)?

4. FlatGeobuf is perhaps a too technical name, not catchy enough and has a bad/boring abbreviation. Other candidates I've considered are OpenGeoFile, OpenVectorFile or OpenSpatialFile but I'm undecided. Any ideas? :)

Regards,

Björn Harrtell

_______________________________________________
gdal-dev mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/gdal-dev
Reply | Threaded
Open this post in threaded view
|

Re: FlatGeobuf; proposal for a new performance oriented vector file format

Björn Harrtell
I've now implemented spatial filtering (at GDAL fork https://github.com/bjornharrtell/gdal/tree/flatgeobuf) and results are promising, for two test cases about 3,5-4 times faster than shapefile with the same content. I've used these command lines to measure total time over 10 iterations for each format and dataset:

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 gis_osm_roads_free_1.shp gis_osm_roads_free_1; done
real 0m1,940s

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 gis_osm_roads_free_1.gpkg gis_osm_roads_free_1; done
real 0m0,750s

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
real 0m0,461s

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 gis_osm_buildings_a_free_1_single.shp gis_osm_buildings_a_free_1_single; done
real 0m4,381s

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 gis_osm_buildings_a_free_1_single.gpkg gis_osm_buildings_a_free_1_single; done
real 0m1,624s

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single; done
real 0m1,113s

The bounds 10 56 10.2 56.2 contains 21525 out of the 812547 total features for gis_osm_roads_free_1 and 57699 out of 2027890 features for gis_osm_buildings_a_free_1_single.

And there is definitely room for optimization. The implementation is naive and reads the whole R-tree in memory and does not yet read consecutive features (common case as the R-tree is balanced) in larger I/O blocks which I think can be significant. If I reduce the extent to 10 56 10.1 56.1 GPKG is actually fair bit faster which I think is due to it being able to read the index partially.

/Björn

Den tis 22 jan. 2019 kl 18:43 skrev Björn Harrtell <[hidden email]>:
After posting about my experimental format I realized that I lack numbers on the potential performance, so I tried to make some more or less scientific measuring. The results was disappointing, reaching similar performance as shapefile for full sequential reads and so I lost interest for a while. But recently I found out that my method of measuring was flawed - I measured the time to read into a new shapefile using ogr2ogr, but it turns out that can be quite unfair due to the string field length dynamic resize that ogr2ogr will do if the strings from input feature attributes are of variable length. I've now made some new measurements using ogrinfo and the undocumented flag -qq to get a full sequential read.

I've used the shapefiles "gis_osm_buildings_a_free_1" (exploded to single poly) and "gis_osm_roads_free_1" from Denmark at http://download.geofabrik.de/europe.html as the source, converted to respective format, then measured the time it takes to run ogrinfo -qq on them. Results (average over three runs):

## gis_osm_buildings_a_free_1 (2027890 Polygons)
* SHP: 3800 ms
* GPKG: 2700 ms
* FlatGeobuf: 2200 ms

## gis_osm_roads_free_1 (812547 LineStrings)
* SHP: 1600 ms
* GPKG: 1350 ms
* FlatGeobuf: 800 ms

With that my hope and interest is rekindled. I believe I can fare even better against the competition with with spatially filtered searches, will hopefully soon have some results on that too.

/Björn

Den sön 9 dec. 2018 kl 20:36 skrev Björn Harrtell <[hidden email]>:
Hi GDAL/OGR folks,

In my spare time I've been working on a vector file format called FlatGeobuf (tentatively). The main reason, besides curiosity and learning, I'm putting time into it is that I think shapefile still rules the read/query static data performance game, which is kind of sad, and probably one of the reasons it is still so widely popular. Also, the main competitor (GeoPackage) isn't suitable for streaming access (AFAIK) which shapefiles also handles surprisingly (?) well.

By using a performance focused write once binary encoding (flatbuffers), schema constraint and focusing on read/query performance by clustering on an optimal spatial index (Packed Hilbert R-Tree) I hope to be able to beat shapefile performance and at the same time be optimal for streaming/cloud access.

I think I'm starting to get somewhere, more details and source is at https://github.com/bjornharrtell/flatgeobuf and I have an early proof of concept driver implementation at https://github.com/bjornharrtell/gdal/tree/flatgeobuf and results are already quite promising - it can do roundtrip read/write and is already quite a bit faster than shapefile. I also have implemented naive read only QGIS provider for experimental purposes.

Basically I'm fishing for interest in this effort, hoping that others will see potential in it even if it's "yet another format" and far from final/stable yet. Any feedback is welcome. As I see it, GDAL is a good place for a format specification and reference implementation to incubate.

Some additional food for thought that I've had during the experimentation:

1. The main in memory representations of geometry/coordinates seem to be either separate arrays per dimension (GDAL (partially?) and QGIS) or a single array with dimension as stride. I've chosen the latter as of yet but I'm still a bit undecided. There is of course a substantial involved in transforming between the two representations so the situation with two competing models is a bit unfortunate. I'm also not sure about which of these models are objectively "better" than the other?

2. One could get better space efficiency with protobuf instead of flatbuffers, but it has a performance cost. I have not gone far into investigating how much though and one could even reason about supporting both these encodings in a single format specification but it's taking it a bit far I think.

3. Is there a reason to support different packing strategies for the R-Tree or is Packed Hilbert a good compromise (besides it being beautifully simple to implement)?

4. FlatGeobuf is perhaps a too technical name, not catchy enough and has a bad/boring abbreviation. Other candidates I've considered are OpenGeoFile, OpenVectorFile or OpenSpatialFile but I'm undecided. Any ideas? :)

Regards,

Björn Harrtell

_______________________________________________
gdal-dev mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/gdal-dev