Btree operators

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Btree operators

Darafei "Komяpa" Praliaskouski
hey there,

https://trac.osgeo.org/postgis/ticket/3521 floated nearby me and broke my heart.

Can we have SELECT DISTINCT work for sure as a complete compare, not just bbox one?

It also touches the topic of sorting the boxes - it seems now they're sorted by X, then by Y - which is usually a poor way to preserve spatial locality. Can we sort by z-order curve instead? just bit-mix all the coordinates: https://en.wikipedia.org/wiki/Z-order_curve - that will make painful voodoo stuff like ORDER BY ST_GeoHash(ST_Transfrom(ST_Centroid(geom),4326)) much cleaner and replace it with just CLUSTER (see https://github.com/openstreetmap/osm2pgsql/issues/208 ).

_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel
Reply | Threaded
Open this post in threaded view
|

Re: Btree operators

Daniel Baston
I've seen plenty of users (including myself) confused by the = operator, especially when it's implicitly accessed in cases like DISTINCT, DISTINCT ON(), and GROUP BY.  The workaround, to serialize and do a string compare, can really add a lot of noise to a query.  Are there benefits that come from the current behavior of the = operator, besides the (high) value of backwards compatibility?

Dan

On Thu, Apr 7, 2016 at 7:22 AM, Komяpa <[hidden email]> wrote:
hey there,

https://trac.osgeo.org/postgis/ticket/3521 floated nearby me and broke my heart.

Can we have SELECT DISTINCT work for sure as a complete compare, not just bbox one?

It also touches the topic of sorting the boxes - it seems now they're sorted by X, then by Y - which is usually a poor way to preserve spatial locality. Can we sort by z-order curve instead? just bit-mix all the coordinates: https://en.wikipedia.org/wiki/Z-order_curve - that will make painful voodoo stuff like ORDER BY ST_GeoHash(ST_Transfrom(ST_Centroid(geom),4326)) much cleaner and replace it with just CLUSTER (see https://github.com/openstreetmap/osm2pgsql/issues/208 ).

_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel


_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel
Reply | Threaded
Open this post in threaded view
|

Re: Btree operators

Paul Ramsey
Speed is a benefit. Once you start down the road of = semantics, the
question of "which equals" looms large. Spatial equality?
Representational equality? Equality within a tolerance? Doing a string
compare is basically the only one that doesn't impose a surprising
cost, and it is, as you say, noisy.

WRT ordering, I did actually briefly have a z-ordering approach in
place, but I cannot remember why I had to drop it: but I did. Shame
about my memory on that.

P

On Thu, Apr 7, 2016 at 6:22 AM, Daniel Baston <[hidden email]> wrote:

> I've seen plenty of users (including myself) confused by the = operator,
> especially when it's implicitly accessed in cases like DISTINCT, DISTINCT
> ON(), and GROUP BY.  The workaround, to serialize and do a string compare,
> can really add a lot of noise to a query.  Are there benefits that come from
> the current behavior of the = operator, besides the (high) value of
> backwards compatibility?
>
> Dan
>
> On Thu, Apr 7, 2016 at 7:22 AM, Komяpa <[hidden email]> wrote:
>>
>> hey there,
>>
>> https://trac.osgeo.org/postgis/ticket/3521 floated nearby me and broke my
>> heart.
>>
>> Can we have SELECT DISTINCT work for sure as a complete compare, not just
>> bbox one?
>>
>> It also touches the topic of sorting the boxes - it seems now they're
>> sorted by X, then by Y - which is usually a poor way to preserve spatial
>> locality. Can we sort by z-order curve instead? just bit-mix all the
>> coordinates: https://en.wikipedia.org/wiki/Z-order_curve - that will make
>> painful voodoo stuff like ORDER BY
>> ST_GeoHash(ST_Transfrom(ST_Centroid(geom),4326)) much cleaner and replace it
>> with just CLUSTER (see https://github.com/openstreetmap/osm2pgsql/issues/208
>> ).
>>
>> _______________________________________________
>> postgis-devel mailing list
>> [hidden email]
>> http://lists.osgeo.org/mailman/listinfo/postgis-devel
>
>
>
> _______________________________________________
> postgis-devel mailing list
> [hidden email]
> http://lists.osgeo.org/mailman/listinfo/postgis-devel
_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel
Reply | Threaded
Open this post in threaded view
|

Re: Btree operators

Darafei "Komяpa" Praliaskouski


чт, 7 апр. 2016 г. в 17:39, Paul Ramsey <[hidden email]>:
Speed is a benefit. Once you start down the road of = semantics, the
question of "which equals" looms large. Spatial equality?
Representational equality? Equality within a tolerance? Doing a string
compare is basically the only one that doesn't impose a surprising
cost, and it is, as you say, noisy.

The equality that is the same as string compare would be a lot less surprising.
It shouldn't be slower than bbox equality with recheck for the case when they're equal.
 
WRT ordering, I did actually briefly have a z-ordering approach in
place, but I cannot remember why I had to drop it: but I did. Shame
about my memory on that.

So, maybe it's time for the new attempt, with at least FAQ entry if everything breaks down? :)
 
P

On Thu, Apr 7, 2016 at 6:22 AM, Daniel Baston <[hidden email]> wrote:
> I've seen plenty of users (including myself) confused by the = operator,
> especially when it's implicitly accessed in cases like DISTINCT, DISTINCT
> ON(), and GROUP BY.  The workaround, to serialize and do a string compare,
> can really add a lot of noise to a query.  Are there benefits that come from
> the current behavior of the = operator, besides the (high) value of
> backwards compatibility?
>
> Dan
>
> On Thu, Apr 7, 2016 at 7:22 AM, Komяpa <[hidden email]> wrote:
>>
>> hey there,
>>
>> https://trac.osgeo.org/postgis/ticket/3521 floated nearby me and broke my
>> heart.
>>
>> Can we have SELECT DISTINCT work for sure as a complete compare, not just
>> bbox one?
>>
>> It also touches the topic of sorting the boxes - it seems now they're
>> sorted by X, then by Y - which is usually a poor way to preserve spatial
>> locality. Can we sort by z-order curve instead? just bit-mix all the
>> coordinates: https://en.wikipedia.org/wiki/Z-order_curve - that will make
>> painful voodoo stuff like ORDER BY
>> ST_GeoHash(ST_Transfrom(ST_Centroid(geom),4326)) much cleaner and replace it
>> with just CLUSTER (see https://github.com/openstreetmap/osm2pgsql/issues/208
>> ).
>>
>> _______________________________________________
>> postgis-devel mailing list
>> [hidden email]
>> http://lists.osgeo.org/mailman/listinfo/postgis-devel
>
>
>
> _______________________________________________
> postgis-devel mailing list
> [hidden email]
> http://lists.osgeo.org/mailman/listinfo/postgis-devel
_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel

_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel
Reply | Threaded
Open this post in threaded view
|

Re: Btree operators

Sandro Santilli-2
On Thu, Apr 07, 2016 at 03:06:56PM +0000, Komяpa wrote:

> чт, 7 апр. 2016 г. в 17:39, Paul Ramsey <[hidden email]>:
>
> > Speed is a benefit. Once you start down the road of = semantics, the
> > question of "which equals" looms large. Spatial equality?
> > Representational equality? Equality within a tolerance? Doing a string
> > compare is basically the only one that doesn't impose a surprising
> > cost, and it is, as you say, noisy.
>
> The equality that is the same as string compare would be a lot less
> surprising.
> It shouldn't be slower than bbox equality with recheck for the case when
> they're equal.

We don't even need to convert to string, actually.
A memcmp call would be pretty fast.

--strk;
_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel
Reply | Threaded
Open this post in threaded view
|

Re: Btree operators

Rémi Cura
Under the hood bounding box equality seems like a very counter intuitive behaviour.

On the other hand, adding a "::text" or "::bytea"in the "distinct on" part does not feel like to much of a burden,
although it is mighty ugly.

Having a = operator with tolerance would be super useful, but I don't see how you could use it in a distinct query.

Cheers,
Rémi-C

2016-04-07 17:13 GMT+02:00 Sandro Santilli <[hidden email]>:
On Thu, Apr 07, 2016 at 03:06:56PM +0000, Komяpa wrote:
> чт, 7 апр. 2016 г. в 17:39, Paul Ramsey <[hidden email]>:
>
> > Speed is a benefit. Once you start down the road of = semantics, the
> > question of "which equals" looms large. Spatial equality?
> > Representational equality? Equality within a tolerance? Doing a string
> > compare is basically the only one that doesn't impose a surprising
> > cost, and it is, as you say, noisy.
>
> The equality that is the same as string compare would be a lot less
> surprising.
> It shouldn't be slower than bbox equality with recheck for the case when
> they're equal.

We don't even need to convert to string, actually.
A memcmp call would be pretty fast.

--strk;
_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel


_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel
Reply | Threaded
Open this post in threaded view
|

Re: Btree operators

Darafei "Komяpa" Praliaskouski
Having counter-intuitive behaviour leads to realy sublte bugs that are nearly impossible to debug.
SQL was meant as human-readable, things like this one really break the perceprion of it as of English text and break expectations.

Distinct is (meant? felt?) to deduplicate identical records, that appeared possibly because of some join elsewhere in the flow (or for any other reason). For floating point numbers there, I don't really expect "nearly equals" behaviour, so ::bytea/memcmp-like comparsion for geometries seems sane in the case. 

When I do SELECT DISTINCT geom, I want _distinct geometry_, not _geometry with distinct boxes_. For distinct boxes I'd write SELECT DISTINCT on (ST_Envelope(geom)) geom, and that's rather rare case. These two being swapped really require mind-twisting, and the more mind-twisting it requires the less people can use it.

How about sorting by zig-zag-encoded coordinates? 

Are there any showstoppers to implement this change, except of everyone has to REINDEX?

чт, 7 апр. 2016 г. в 20:01, Rémi Cura <[hidden email]>:
Under the hood bounding box equality seems like a very counter intuitive behaviour.

On the other hand, adding a "::text" or "::bytea"in the "distinct on" part does not feel like to much of a burden,
although it is mighty ugly.

Having a = operator with tolerance would be super useful, but I don't see how you could use it in a distinct query.

Cheers,
Rémi-C

2016-04-07 17:13 GMT+02:00 Sandro Santilli <[hidden email]>:
On Thu, Apr 07, 2016 at 03:06:56PM +0000, Komяpa wrote:
> чт, 7 апр. 2016 г. в 17:39, Paul Ramsey <[hidden email]>:
>
> > Speed is a benefit. Once you start down the road of = semantics, the
> > question of "which equals" looms large. Spatial equality?
> > Representational equality? Equality within a tolerance? Doing a string
> > compare is basically the only one that doesn't impose a surprising
> > cost, and it is, as you say, noisy.
>
> The equality that is the same as string compare would be a lot less
> surprising.
> It shouldn't be slower than bbox equality with recheck for the case when
> they're equal.

We don't even need to convert to string, actually.
A memcmp call would be pretty fast.

--strk;
_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel

_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel

_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel
Reply | Threaded
Open this post in threaded view
|

Re: Btree operators

Sandro Santilli-2
On Mon, Apr 11, 2016 at 11:06:40AM +0000, Komяpa wrote:

> When I do SELECT DISTINCT geom, I want _distinct geometry_, not _geometry
> with distinct boxes_. For distinct boxes I'd write SELECT DISTINCT on
> (ST_Envelope(geom)) geom, and that's rather rare case. These two being
> swapped really require mind-twisting, and the more mind-twisting it
> requires the less people can use it.
>
> How about sorting by zig-zag-encoded coordinates?

If equality has to be memcmp-like, comparision need to be memcmp-like too
as I think it's a requirement for A=B to mean neither A < B  nor B < A.

> Are there any showstoppers to implement this change, except of everyone has
> to REINDEX?

It's still missing a plan, as far as I can tell.

--strk;
_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel
Reply | Threaded
Open this post in threaded view
|

Re: Btree operators

Darafei "Komяpa" Praliaskouski


пн, 11 апр. 2016 г. в 16:38, Sandro Santilli <[hidden email]>:
On Mon, Apr 11, 2016 at 11:06:40AM +0000, Komяpa wrote:

> When I do SELECT DISTINCT geom, I want _distinct geometry_, not _geometry
> with distinct boxes_. For distinct boxes I'd write SELECT DISTINCT on
> (ST_Envelope(geom)) geom, and that's rather rare case. These two being
> swapped really require mind-twisting, and the more mind-twisting it
> requires the less people can use it.
>
> How about sorting by zig-zag-encoded coordinates?

If equality has to be memcmp-like, comparision need to be memcmp-like too
as I think it's a requirement for A=B to mean neither A < B  nor B < A.

This can be worked around by checking the memcmp-like ordering for case when the zig-zag-encoded coordinates are the same.
 
> Are there any showstoppers to implement this change, except of everyone has
> to REINDEX?

It's still missing a plan, as far as I can tell.

What's required for such a plan? 

_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel
Reply | Threaded
Open this post in threaded view
|

Re: Btree operators

Mark Cave-Ayland
In reply to this post by Darafei "Komяpa" Praliaskouski
On 11/04/16 12:06, Komяpa wrote:

> Having counter-intuitive behaviour leads to realy sublte bugs that are
> nearly impossible to debug.
> SQL was meant as human-readable, things like this one really break the
> perceprion of it as of English text and break expectations.
>
> Distinct is (meant? felt?) to deduplicate identical records, that
> appeared possibly because of some join elsewhere in the flow (or for any
> other reason). For floating point numbers there, I don't really expect
> "nearly equals" behaviour, so ::bytea/memcmp-like comparsion for
> geometries seems sane in the case.
>
> When I do SELECT DISTINCT geom, I want _distinct geometry_, not
> _geometry with distinct boxes_. For distinct boxes I'd write SELECT
> DISTINCT on (ST_Envelope(geom)) geom, and that's rather rare case. These
> two being swapped really require mind-twisting, and the more
> mind-twisting it requires the less people can use it.
>
> How about sorting by zig-zag-encoded coordinates?
>
> Are there any showstoppers to implement this change, except of everyone
> has to REINDEX?

Bear in mind that DISTINCT isn't quite as simple as you may imagine -
for example would you consider two geometries with the same coordinates
but in reverse order the same? Or how about two geometries that are
exactly the same but one with a 3rd dimension coordinate added to some
vertices?

Also note that memcmp() can't be directly used on floating point numbers
since it is possible to have multiple binary representations of the same
number (a quick search for memcmp floating point numbers will point you
towards some of the issues).


HTH,

Mark.

_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel
Reply | Threaded
Open this post in threaded view
|

Re: Btree operators

Darafei "Komяpa" Praliaskouski


пн, 11 апр. 2016 г. в 17:36, Mark Cave-Ayland <[hidden email]>:
On 11/04/16 12:06, Komяpa wrote:

> Having counter-intuitive behaviour leads to realy sublte bugs that are
> nearly impossible to debug.
> SQL was meant as human-readable, things like this one really break the
> perceprion of it as of English text and break expectations.
>
> Distinct is (meant? felt?) to deduplicate identical records, that
> appeared possibly because of some join elsewhere in the flow (or for any
> other reason). For floating point numbers there, I don't really expect
> "nearly equals" behaviour, so ::bytea/memcmp-like comparsion for
> geometries seems sane in the case.
>
> When I do SELECT DISTINCT geom, I want _distinct geometry_, not
> _geometry with distinct boxes_. For distinct boxes I'd write SELECT
> DISTINCT on (ST_Envelope(geom)) geom, and that's rather rare case. These
> two being swapped really require mind-twisting, and the more
> mind-twisting it requires the less people can use it.
>
> How about sorting by zig-zag-encoded coordinates?
>
> Are there any showstoppers to implement this change, except of everyone
> has to REINDEX?

Bear in mind that DISTINCT isn't quite as simple as you may imagine -
for example would you consider two geometries with the same coordinates
but in reverse order the same?
 
They're different.
 
Or how about two geometries that are
exactly the same but one with a 3rd dimension coordinate added to some
vertices?
 
They're different. 

Also note that memcmp() can't be directly used on floating point numbers
since it is possible to have multiple binary representations of the same
number (a quick search for memcmp floating point numbers will point you
towards some of the issues).
 
That's fine.

It's no different form currently recommended ::bytea or ::text, but keeps more readability in the SQL code.

If someone needs more sophisticated filtering, they can always do the filtering that fits them better, like ST_SnapToGrid or ST_CollectionHomogenize or ST_ForceRHR or ST_MakeValid or  ST_Area(ST_Intersection(geom1, geom2))/(ST_Area(geom1)+ST_Area(geom2)) - whatever they like the best.

Current problem is that in current approach DISTINCT kills too much values that differ from one another.

_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel
Reply | Threaded
Open this post in threaded view
|

Re: Btree operators

Sandro Santilli-2
In reply to this post by Darafei "Komяpa" Praliaskouski
On Mon, Apr 11, 2016 at 02:27:03PM +0000, Komяpa wrote:
> > On Mon, Apr 11, 2016 at 11:06:40AM +0000, Komяpa wrote:
>
> > > Are there any showstoppers to implement this change,
> > > except of everyone has to REINDEX?
> >
> > It's still missing a plan, as far as I can tell.
>
> What's required for such a plan?

How about a short RFC ?
See https://trac.osgeo.org/postgis/wiki/DevWikiRFC

--strk;
_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel
Reply | Threaded
Open this post in threaded view
|

Re: Btree operators

Mark Cave-Ayland
In reply to this post by Darafei "Komяpa" Praliaskouski
On 11/04/16 15:47, Komяpa wrote:

> пн, 11 апр. 2016 г. в 17:36, Mark Cave-Ayland
> <[hidden email] <mailto:[hidden email]>>:
>
>     On 11/04/16 12:06, Komяpa wrote:
>
>     > Having counter-intuitive behaviour leads to realy sublte bugs that are
>     > nearly impossible to debug.
>     > SQL was meant as human-readable, things like this one really break the
>     > perceprion of it as of English text and break expectations.
>     >
>     > Distinct is (meant? felt?) to deduplicate identical records, that
>     > appeared possibly because of some join elsewhere in the flow (or
>     for any
>     > other reason). For floating point numbers there, I don't really expect
>     > "nearly equals" behaviour, so ::bytea/memcmp-like comparsion for
>     > geometries seems sane in the case.
>     >
>     > When I do SELECT DISTINCT geom, I want _distinct geometry_, not
>     > _geometry with distinct boxes_. For distinct boxes I'd write SELECT
>     > DISTINCT on (ST_Envelope(geom)) geom, and that's rather rare case.
>     These
>     > two being swapped really require mind-twisting, and the more
>     > mind-twisting it requires the less people can use it.
>     >
>     > How about sorting by zig-zag-encoded coordinates?
>     >
>     > Are there any showstoppers to implement this change, except of
>     everyone
>     > has to REINDEX?
>
>     Bear in mind that DISTINCT isn't quite as simple as you may imagine -
>     for example would you consider two geometries with the same coordinates
>     but in reverse order the same?
>
>  
> They're different.
>  
>
>     Or how about two geometries that are
>     exactly the same but one with a 3rd dimension coordinate added to some
>     vertices?
>
>  
> They're different.
>
>     Also note that memcmp() can't be directly used on floating point numbers
>     since it is possible to have multiple binary representations of the same
>     number (a quick search for memcmp floating point numbers will point you
>     towards some of the issues).
>
>  
> That's fine.
>
> It's no different form currently recommended ::bytea or ::text, but
> keeps more readability in the SQL code.

If you're thinking about binary representation, sure, but as a
counter-example some people may already rely on this behaviour, perhaps
unwittingly, as part of an existing merge process. Not that this would
give a consistent result, but it's a change in user-visible behaviour.

> If someone needs more sophisticated filtering, they can always do the
> filtering that fits them better, like ST_SnapToGrid or
> ST_CollectionHomogenize or ST_ForceRHR or ST_MakeValid or
>  ST_Area(ST_Intersection(geom1, geom2))/(ST_Area(geom1)+ST_Area(geom2))
> - whatever they like the best.
>
> Current problem is that in current approach DISTINCT kills too much
> values that differ from one another.

Also if you're changing the geometry equals function then this also
affects the use of index scans in GROUP BY and DISTINCT (B-Tree
operators are used by PostgreSQL for sorts in these cases). You've also
got to think about whether any change in behaviour is intuitive and
consistent across all variants of R-Tree/B-Tree/heap scans and ST_Equals
- some functions will use = to cut the bounding box first as an index
optimisation before applying the more correct ST_Equals. There is also
the issue whereby if you're now pulling complete geometries vs. bounding
boxes then some index-only optimisations will now be unavailable to the
planner so there can be performance regressions on large datasets and
those with fewer but "wide" geometries.

Now there are probably ways to work around some of these issues but it
will require some thought. I suspect the main problem is that people
will have applications that depend upon the existing behaviour and so
such a change would need careful management across a major release.


ATB,

Mark.

_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel