getrelid && list_nth

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

getrelid && list_nth

Sandro Santilli-2
Hello Mark,
I was writing this to pgsql-hackers, but I've solved it
so I turned it into a report.

Working on postgis selectivity function for type 'geometry'
I've incurred in the following problem:

The getrelid() macro defined in parser/parsetree.h invokes
the rt_fetch() macro therein defined, which in turn invokes
list_nth.

Now list_nth() is not a macro, and postgis
is not linked to postgresql dynamic library, so
it fails to load.

This worked until some months ago (PG was already
versioned as 75dev).

To fix this I've copied list_nth and list_nth_cell into postgis_estimate.c
removing superflous ( I hope ) checks on boundaries.

Now estimation selectivity is back up and runnign for PG75

Still... any stable utility interface would be appreciated.

--strk;

Reply | Threaded
Open this post in threaded view
|

RE: getrelid && list_nth

Mark Cave-Ayland-2
Hi strk,

> -----Original Message-----
> From: strk [mailto:[hidden email]]
> Sent: 08 June 2004 18:19
> To: Mark Cave-Ayland
> Cc: [hidden email]
> Subject: getrelid && list_nth
>
>
> Hello Mark,
> I was writing this to pgsql-hackers, but I've solved it
> so I turned it into a report.
>
> Working on postgis selectivity function for type 'geometry'
> I've incurred in the following problem:
>
> The getrelid() macro defined in parser/parsetree.h invokes
> the rt_fetch() macro therein defined, which in turn invokes list_nth.
>
> Now list_nth() is not a macro, and postgis
> is not linked to postgresql dynamic library, so
> it fails to load.
>
> This worked until some months ago (PG was already
> versioned as 75dev).

FYI the List API has recently changed in PostgreSQL 7.5 which is
probably why this has broken - please see the following links:

http://archives.postgresql.org/pgsql-patches/2004-05/msg00344.php
http://archives.postgresql.org/pgsql-patches/2004-05/msg00376.php

and the thread here:

http://archives.postgresql.org/pgsql-patches/2004-05/msg00500.php


> To fix this I've copied list_nth and list_nth_cell into
> postgis_estimate.c removing superflous ( I hope ) checks on
> boundaries.

Yuck. I don't think we should be duplicating the PostgreSQL list API
just because the API has changed between versions. Any chance you could
look at the posted patch and correct CVS so that this is not needed?

> Now estimation selectivity is back up and runnign for PG75
>
> Still... any stable utility interface would be appreciated.

BTW did you find that building large GiST indices crashes PostgreSQL?
Since the GiST API has changed you will probably need to apply my "super
patch" in order to get things working correctly as the problem wasn't
just a Win32 issue (see
http://postgis.refractions.net/pipermail/postgis-users/2004-May/004918.h
tml)


Would any developers object if the "super patch" for 7.5 were to be
applied to CVS? If no-one objects I'd like to apply it real soon....



Cheers,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.



Reply | Threaded
Open this post in threaded view
|

Re: getrelid && list_nth

Sandro Santilli-2
On Wed, Jun 09, 2004 at 09:39:26AM +0100, Mark Cave-Ayland wrote:

> Hi strk,
>
> > -----Original Message-----
> > From: strk [mailto:[hidden email]]
> > Sent: 08 June 2004 18:19
> > To: Mark Cave-Ayland
> > Cc: [hidden email]
> > Subject: getrelid && list_nth
> >
> >
> > Hello Mark,
> > I was writing this to pgsql-hackers, but I've solved it
> > so I turned it into a report.
> >
> > Working on postgis selectivity function for type 'geometry'
> > I've incurred in the following problem:
> >
> > The getrelid() macro defined in parser/parsetree.h invokes
> > the rt_fetch() macro therein defined, which in turn invokes list_nth.
> >
> > Now list_nth() is not a macro, and postgis
> > is not linked to postgresql dynamic library, so
> > it fails to load.
> >
> > This worked until some months ago (PG was already
> > versioned as 75dev).
>
> FYI the List API has recently changed in PostgreSQL 7.5 which is
> probably why this has broken - please see the following links:
>
> http://archives.postgresql.org/pgsql-patches/2004-05/msg00344.php
> http://archives.postgresql.org/pgsql-patches/2004-05/msg00376.php
>
> and the thread here:
>
> http://archives.postgresql.org/pgsql-patches/2004-05/msg00500.php
>
>
> > To fix this I've copied list_nth and list_nth_cell into
> > postgis_estimate.c removing superflous ( I hope ) checks on
> > boundaries.
>
> Yuck. I don't think we should be duplicating the PostgreSQL list API
> just because the API has changed between versions. Any chance you could
> look at the posted patch and correct CVS so that this is not needed?

Do you mean the PGSQL List API change patch ?

>
> > Now estimation selectivity is back up and runnign for PG75
> >
> > Still... any stable utility interface would be appreciated.
>
> BTW did you find that building large GiST indices crashes PostgreSQL?

Nope. I haven't been playing with PG75 for a while.

> Since the GiST API has changed you will probably need to apply my "super
> patch" in order to get things working correctly as the problem wasn't
> just a Win32 issue (see
> http://postgis.refractions.net/pipermail/postgis-users/2004-May/004918.h
> tml)
>
> Would any developers object if the "super patch" for 7.5 were to be
> applied to CVS? If no-one objects I'd like to apply it real soon....

I'll take a look at it.

--strk;

>
>
>
> Cheers,
>
> Mark.
>
> ---
>
> Mark Cave-Ayland
> Webbased Ltd.
> Tamar Science Park
> Derriford
> Plymouth
> PL6 8BX
> England
>
> Tel: +44 (0)1752 764445
> Fax: +44 (0)1752 764446
>
>
> This email and any attachments are confidential to the intended
> recipient and may also be privileged. If you are not the intended
> recipient please delete it from your system and notify the sender. You
> should not copy it or use it for any purpose nor disclose or distribute
> its contents to any other person.
>

Reply | Threaded
Open this post in threaded view
|

RE: getrelid && list_nth

Mark Cave-Ayland-2
In reply to this post by Sandro Santilli-2
Hi strk,

> -----Original Message-----
> From: 'strk' [mailto:[hidden email]]
> Sent: 09 June 2004 09:57
> To: Mark Cave-Ayland
> Cc: [hidden email]
> Subject: Re: getrelid && list_nth
>
>
> On Wed, Jun 09, 2004 at 09:39:26AM +0100, Mark Cave-Ayland wrote:
> > Hi strk,
> >
> > > -----Original Message-----
> > > From: strk [mailto:[hidden email]]
> > > Sent: 08 June 2004 18:19
> > > To: Mark Cave-Ayland
> > > Cc: [hidden email]
> > > Subject: getrelid && list_nth
> > >
> > >
> > > Hello Mark,
> > > I was writing this to pgsql-hackers, but I've solved it
> > > so I turned it into a report.
> > >
> > > Working on postgis selectivity function for type 'geometry'
> > > I've incurred in the following problem:
> > >
> > > The getrelid() macro defined in parser/parsetree.h invokes the
> > > rt_fetch() macro therein defined, which in turn invokes list_nth.
> > >
> > > Now list_nth() is not a macro, and postgis
> > > is not linked to postgresql dynamic library, so
> > > it fails to load.
> > >
> > > This worked until some months ago (PG was already
> > > versioned as 75dev).
> >
> > FYI the List API has recently changed in PostgreSQL 7.5 which is
> > probably why this has broken - please see the following links:
> >
> > http://archives.postgresql.org/pgsql-patches/2004-05/msg00344.php
> > http://archives.postgresql.org/pgsql-patches/2004-05/msg00376.php
> >
> > and the thread here:
> >
> > http://archives.postgresql.org/pgsql-patches/2004-05/msg00500.php
> >
> >
> > > To fix this I've copied list_nth and list_nth_cell into
> > > postgis_estimate.c removing superflous ( I hope ) checks on
> > > boundaries.
> >
> > Yuck. I don't think we should be duplicating the PostgreSQL
> list API
> > just because the API has changed between versions. Any chance you
> > could look at the posted patch and correct CVS so that this is not
> > needed?
>
> Do you mean the PGSQL List API change patch ?

Yes - I'm guessing that Dave must have borrowed the code from an
existing selectivity function either in the PostgreSQL core (probably
selfuncs.c!) or contrib/ so hopefully the diff will allow you can see
which changes need to be made. My guess from perusing the links above
would be that it is something simple like the macro name has been
changed.
 
> > BTW did you find that building large GiST indices crashes
> PostgreSQL?
>
> Nope. I haven't been playing with PG75 for a while.

:) Me neither. The GiST API has changed again which is what resulted in
my "super patch" to stop PostgreSQL crashing when building geometric
GiST indices.

> > Would any developers object if the "super patch" for 7.5 were to be
> > applied to CVS? If no-one objects I'd like to apply it real soon....
>
> I'll take a look at it.

Many thanks. It's actually a very simple patch but it also includes
Romi's fixes to get things to compile under Win32 MingW too. With this
patch applied, and the list API change done, you should find that CVS
should compile under 7.5 once again.


Cheers,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.



Reply | Threaded
Open this post in threaded view
|

Re: getrelid && list_nth

Sandro Santilli-2
In reply to this post by Sandro Santilli-2
I've applied Romi's patches (slightly modified).
And added you postgis_gist75.c (adding infinite geometry handling to it).
Don't like the handling in Makefile... but we can fix that later.

As for the list_nth copy in postgis_estimate.c, I've removed it.
It seems to build and run anyway now.
Yet estimator work is far from correct... enabling debugging output
shows weird results, with nan and inf values coming out often...

--strk;


On Wed, Jun 09, 2004 at 10:56:49AM +0200, 'strk' wrote:

> On Wed, Jun 09, 2004 at 09:39:26AM +0100, Mark Cave-Ayland wrote:
> > Hi strk,
> >
> > > -----Original Message-----
> > > From: strk [mailto:[hidden email]]
> > > Sent: 08 June 2004 18:19
> > > To: Mark Cave-Ayland
> > > Cc: [hidden email]
> > > Subject: getrelid && list_nth
> > >
> > >
> > > Hello Mark,
> > > I was writing this to pgsql-hackers, but I've solved it
> > > so I turned it into a report.
> > >
> > > Working on postgis selectivity function for type 'geometry'
> > > I've incurred in the following problem:
> > >
> > > The getrelid() macro defined in parser/parsetree.h invokes
> > > the rt_fetch() macro therein defined, which in turn invokes list_nth.
> > >
> > > Now list_nth() is not a macro, and postgis
> > > is not linked to postgresql dynamic library, so
> > > it fails to load.
> > >
> > > This worked until some months ago (PG was already
> > > versioned as 75dev).
> >
> > FYI the List API has recently changed in PostgreSQL 7.5 which is
> > probably why this has broken - please see the following links:
> >
> > http://archives.postgresql.org/pgsql-patches/2004-05/msg00344.php
> > http://archives.postgresql.org/pgsql-patches/2004-05/msg00376.php
> >
> > and the thread here:
> >
> > http://archives.postgresql.org/pgsql-patches/2004-05/msg00500.php
> >
> >
> > > To fix this I've copied list_nth and list_nth_cell into
> > > postgis_estimate.c removing superflous ( I hope ) checks on
> > > boundaries.
> >
> > Yuck. I don't think we should be duplicating the PostgreSQL list API
> > just because the API has changed between versions. Any chance you could
> > look at the posted patch and correct CVS so that this is not needed?
>
> Do you mean the PGSQL List API change patch ?
>
> >
> > > Now estimation selectivity is back up and runnign for PG75
> > >
> > > Still... any stable utility interface would be appreciated.
> >
> > BTW did you find that building large GiST indices crashes PostgreSQL?
>
> Nope. I haven't been playing with PG75 for a while.
>
> > Since the GiST API has changed you will probably need to apply my "super
> > patch" in order to get things working correctly as the problem wasn't
> > just a Win32 issue (see
> > http://postgis.refractions.net/pipermail/postgis-users/2004-May/004918.h
> > tml)
> >
> > Would any developers object if the "super patch" for 7.5 were to be
> > applied to CVS? If no-one objects I'd like to apply it real soon....
>
> I'll take a look at it.
>
> --strk;
>
> >
> >
> >
> > Cheers,
> >
> > Mark.
> >
> > ---
> >
> > Mark Cave-Ayland
> > Webbased Ltd.
> > Tamar Science Park
> > Derriford
> > Plymouth
> > PL6 8BX
> > England
> >
> > Tel: +44 (0)1752 764445
> > Fax: +44 (0)1752 764446
> >
> >
> > This email and any attachments are confidential to the intended
> > recipient and may also be privileged. If you are not the intended
> > recipient please delete it from your system and notify the sender. You
> > should not copy it or use it for any purpose nor disclose or distribute
> > its contents to any other person.
> >

Reply | Threaded
Open this post in threaded view
|

Re: getrelid && list_nth

Sandro Santilli-2
On Wed, Jun 09, 2004 at 12:08:02PM +0200, 'strk' wrote:
> I've applied Romi's patches (slightly modified).
> And added you postgis_gist75.c (adding infinite geometry handling to it).
> Don't like the handling in Makefile... but we can fix that later.
>
> As for the list_nth copy in postgis_estimate.c, I've removed it.
> It seems to build and run anyway now.
> Yet estimator work is far from correct... enabling debugging output
> shows weird results, with nan and inf values coming out often...

Ok. I've found the reason... it is due to the fact that
my table contained only equal points, so the histogram
consisted in 40x40 infinitesimal cells...
Whe should probably handle similar cases quickly returning
0 for data-point not contained overlap search-box and
1 for data-point contained in search-box

what do you think ?

--strk;

>
> --strk;
>
>
> On Wed, Jun 09, 2004 at 10:56:49AM +0200, 'strk' wrote:
> > On Wed, Jun 09, 2004 at 09:39:26AM +0100, Mark Cave-Ayland wrote:
> > > Hi strk,
> > >
> > > > -----Original Message-----
> > > > From: strk [mailto:[hidden email]]
> > > > Sent: 08 June 2004 18:19
> > > > To: Mark Cave-Ayland
> > > > Cc: [hidden email]
> > > > Subject: getrelid && list_nth
> > > >
> > > >
> > > > Hello Mark,
> > > > I was writing this to pgsql-hackers, but I've solved it
> > > > so I turned it into a report.
> > > >
> > > > Working on postgis selectivity function for type 'geometry'
> > > > I've incurred in the following problem:
> > > >
> > > > The getrelid() macro defined in parser/parsetree.h invokes
> > > > the rt_fetch() macro therein defined, which in turn invokes list_nth.
> > > >
> > > > Now list_nth() is not a macro, and postgis
> > > > is not linked to postgresql dynamic library, so
> > > > it fails to load.
> > > >
> > > > This worked until some months ago (PG was already
> > > > versioned as 75dev).
> > >
> > > FYI the List API has recently changed in PostgreSQL 7.5 which is
> > > probably why this has broken - please see the following links:
> > >
> > > http://archives.postgresql.org/pgsql-patches/2004-05/msg00344.php
> > > http://archives.postgresql.org/pgsql-patches/2004-05/msg00376.php
> > >
> > > and the thread here:
> > >
> > > http://archives.postgresql.org/pgsql-patches/2004-05/msg00500.php
> > >
> > >
> > > > To fix this I've copied list_nth and list_nth_cell into
> > > > postgis_estimate.c removing superflous ( I hope ) checks on
> > > > boundaries.
> > >
> > > Yuck. I don't think we should be duplicating the PostgreSQL list API
> > > just because the API has changed between versions. Any chance you could
> > > look at the posted patch and correct CVS so that this is not needed?
> >
> > Do you mean the PGSQL List API change patch ?
> >
> > >
> > > > Now estimation selectivity is back up and runnign for PG75
> > > >
> > > > Still... any stable utility interface would be appreciated.
> > >
> > > BTW did you find that building large GiST indices crashes PostgreSQL?
> >
> > Nope. I haven't been playing with PG75 for a while.
> >
> > > Since the GiST API has changed you will probably need to apply my "super
> > > patch" in order to get things working correctly as the problem wasn't
> > > just a Win32 issue (see
> > > http://postgis.refractions.net/pipermail/postgis-users/2004-May/004918.h
> > > tml)
> > >
> > > Would any developers object if the "super patch" for 7.5 were to be
> > > applied to CVS? If no-one objects I'd like to apply it real soon....
> >
> > I'll take a look at it.
> >
> > --strk;
> >
> > >
> > >
> > >
> > > Cheers,
> > >
> > > Mark.
> > >
> > > ---
> > >
> > > Mark Cave-Ayland
> > > Webbased Ltd.
> > > Tamar Science Park
> > > Derriford
> > > Plymouth
> > > PL6 8BX
> > > England
> > >
> > > Tel: +44 (0)1752 764445
> > > Fax: +44 (0)1752 764446
> > >
> > >
> > > This email and any attachments are confidential to the intended
> > > recipient and may also be privileged. If you are not the intended
> > > recipient please delete it from your system and notify the sender. You
> > > should not copy it or use it for any purpose nor disclose or distribute
> > > its contents to any other person.
> > >
> _______________________________________________
> postgis-devel mailing list
> [hidden email]
> http://postgis.refractions.net/mailman/listinfo/postgis-devel

Reply | Threaded
Open this post in threaded view
|

RE: getrelid && list_nth

David Blasby-3
In reply to this post by Mark Cave-Ayland-2
Mark Cave-Ayland wrote:
> Would any developers object if the "super patch" for 7.5 were to be
> applied to CVS? If no-one objects I'd like to apply it real soon....


Is this for postgis_gist75.c or does it affect other stuff?  If its just
the gist75 code, then go for it - we dont have a stable or tested 7.5
release yet.

dave

Reply | Threaded
Open this post in threaded view
|

RE: getrelid && list_nth

Sandro Santilli-2
On Wed, Jun 09, 2004 at 09:39:10AM -0700, David Blasby wrote:
> Mark Cave-Ayland wrote:
> >Would any developers object if the "super patch" for 7.5 were to be
> >applied to CVS? If no-one objects I'd like to apply it real soon....
>
>
> Is this for postgis_gist75.c or does it affect other stuff?  If its just
> the gist75 code, then go for it - we dont have a stable or tested 7.5
> release yet.

I've already applied the patch. GiST stuff affects PG75 only.
win32 compile options are harmless (an include and strchr/strrchr
use instead of index/rindex).

--strk;

>
> dave
> _______________________________________________
> postgis-devel mailing list
> [hidden email]
> http://postgis.refractions.net/mailman/listinfo/postgis-devel

Reply | Threaded
Open this post in threaded view
|

Re: getrelid && list_nth

Mark Cave-Ayland-2
In reply to this post by Sandro Santilli-2
Hi strk,

> -----Original Message-----
> From: 'strk' [mailto:[hidden email]]
> Sent: 09 June 2004 11:46
> To: Mark Cave-Ayland; [hidden email]
> Subject: Re: [postgis-devel] Re: getrelid && list_nth
>
>
> Ok. I've found the reason... it is due to the fact that
> my table contained only equal points, so the histogram
> consisted in 40x40 infinitesimal cells... Whe should probably
> handle similar cases quickly returning 0 for data-point not
> contained overlap search-box and
> 1 for data-point contained in search-box
>
> what do you think ?
>
> --strk;

Ahhhh another corner case! :)

I think the simplest thing to do in this case would be to expand the
area of the histogram by a small amount, for example if
sample_extent->LLB.x == sample_extent->URT.x then:

geomstats->xmin = sample_extent->LLB.x - 1.0
geomstats->xmax = sample_extent->URT.x + 1.0

and similarly if if sample_extent->LLB.y == sample_extent->URT.y then:

geomstats->ymin = sample_extent->LLB.y - 1.0
geomstats->ymax = sample_extent->URT.y + 1.0


The other corner case I have seen was when one of our import scripts
went wrong and we ended up with several very large x/y coords in our
table along with our normal data. When we did an ANALYZE, because the
overall extents were so large, all of the geometries appeared in a
single box in the bottom left hand corner of histogram and hence
provided really bad query plans - deleting these erroneus geometries and
doing ANALYZE solved the problem and everything went back to normal.

The solution I thought of was to try and reduce the effect of outliers
using some statistical theory: for example we know that for a random
sample of data, 95% of the data lies within 2 standard deviations of the
mean. So I was considering calculating the mean of all LLB.x/y and
URT.x/y, using this to calculate the standard deviation, and hence
calculate the geomextents from this. This should hopefully reduce the
effect of excessively large coordinates on the extents of the
selectivity histogram at the cost of making the histogram area slightly
smaller than that currently indicated by the overall sample extents.

The pseudo-code example for LLB.x would look something like:

        // Calculate mean
        for each LLB.x in sample
                sumLLB.x += LLB.x
        next
        meanLLB.x = sumLLB.x / samplerows

        // Calculate standard deviation
        for each LLB.x in sample
                s2 = (LLB.x - meanLLB.x) * (LLB.x - meanLLB.x)
        next
        s2 /= (samplerows - 1)
        s = sqrt(s2)

        // For 95% coverage we should lie within 2sds of the mean
        geomstats->xmin = meanLLB.x - 2 * s


The downside is that we have to iterate through the sample tuples twice,
once to calculate the mean and then once again to calculate the standard
deviation. However, since we are only using a sample of rows from the
table then hopefully the effect on larger tables will be negligible.
Comments anyone?


Cheers,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.



Reply | Threaded
Open this post in threaded view
|

Re: getrelid && list_nth

Sandro Santilli-2
On Thu, Jun 10, 2004 at 12:36:35PM +0100, Mark Cave-Ayland wrote:

> Hi strk,
>
> > -----Original Message-----
> > From: 'strk' [mailto:[hidden email]]
> > Sent: 09 June 2004 11:46
> > To: Mark Cave-Ayland; [hidden email]
> > Subject: Re: [postgis-devel] Re: getrelid && list_nth
> >
> >
> > Ok. I've found the reason... it is due to the fact that
> > my table contained only equal points, so the histogram
> > consisted in 40x40 infinitesimal cells... Whe should probably
> > handle similar cases quickly returning 0 for data-point not
> > contained overlap search-box and
> > 1 for data-point contained in search-box
> >
> > what do you think ?
> >
> > --strk;
>
> Ahhhh another corner case! :)
>
> I think the simplest thing to do in this case would be to expand the
> area of the histogram by a small amount, for example if
> sample_extent->LLB.x == sample_extent->URT.x then:
>
> geomstats->xmin = sample_extent->LLB.x - 1.0
> geomstats->xmax = sample_extent->URT.x + 1.0
>
> and similarly if if sample_extent->LLB.y == sample_extent->URT.y then:
>
> geomstats->ymin = sample_extent->LLB.y - 1.0
> geomstats->ymax = sample_extent->URT.y + 1.0

Makes sense, but it's a waste of time and resources keeping
building and analyzing an histogram when the full stat has
an extent reduced to a point. We could instead make a single box
histogram with a value of 1...

> The other corner case I have seen was when one of our import scripts
> went wrong and we ended up with several very large x/y coords in our
> table along with our normal data. When we did an ANALYZE, because the
> overall extents were so large, all of the geometries appeared in a
> single box in the bottom left hand corner of histogram and hence
> provided really bad query plans - deleting these erroneus geometries and
> doing ANALYZE solved the problem and everything went back to normal.
>
> The solution I thought of was to try and reduce the effect of outliers
> using some statistical theory: for example we know that for a random
> sample of data, 95% of the data lies within 2 standard deviations of the
> mean. So I was considering calculating the mean of all LLB.x/y and
> URT.x/y, using this to calculate the standard deviation, and hence
> calculate the geomextents from this. This should hopefully reduce the
> effect of excessively large coordinates on the extents of the
> selectivity histogram at the cost of making the histogram area slightly
> smaller than that currently indicated by the overall sample extents.
>
> The pseudo-code example for LLB.x would look something like:
>
> // Calculate mean
> for each LLB.x in sample
> sumLLB.x += LLB.x
> next
> meanLLB.x = sumLLB.x / samplerows
>
> // Calculate standard deviation
> for each LLB.x in sample
> s2 = (LLB.x - meanLLB.x) * (LLB.x - meanLLB.x)
> next
> s2 /= (samplerows - 1)
> s = sqrt(s2)
>
> // For 95% coverage we should lie within 2sds of the mean
> geomstats->xmin = meanLLB.x - 2 * s
>
>
> The downside is that we have to iterate through the sample tuples twice,
> once to calculate the mean and then once again to calculate the standard
> deviation. However, since we are only using a sample of rows from the
> table then hopefully the effect on larger tables will be negligible.
> Comments anyone?

Interesting. Am I right saying this computation would make an
extent MUCH smaller for your corner case ?

We already make two scans BTW:
@1357@
        /*
         * First scan:
         *  o find extent of the sample rows
         *  o count null/not-null values
         *  o compute total_width
         *  o compute total features's box area (for avgFeatureArea)
         */
@1357@
        /*
         * Second scan:
         *  o fill histogram values with the number of
         *    features' bbox overlaps: a feature's bvol
         *    can fully overlap (1) or partially overlap
         *    (fraction of 1) an histogram cell.
         *
         *  o compute total cells occupation
         */
 
As you can see the first scan could also compute standard deviation
and handle 0-extent case.

Thinking deeper about these cases, we should probably abandone BoxesPerSide
and use Columns / Rows instead as a bunch of points laying on the
same horizontal line would require many coluns and a single row...
We could calculate the width/height factor and use that to split
the geometry_stats_target*160 in Rows and Columns. We would end up
with always-near-square histogram cells but I don't see a big problem
about it. Moreover, we could set a minimum-histogram-cell size which
would in turn automate the extent-enrlargment you were suggesting.

--strk;

>
>
> Cheers,
>
> Mark.
>
> ---
>
> Mark Cave-Ayland
> Webbased Ltd.
> Tamar Science Park
> Derriford
> Plymouth
> PL6 8BX
> England
>
> Tel: +44 (0)1752 764445
> Fax: +44 (0)1752 764446
>
>
> This email and any attachments are confidential to the intended
> recipient and may also be privileged. If you are not the intended
> recipient please delete it from your system and notify the sender. You
> should not copy it or use it for any purpose nor disclose or distribute
> its contents to any other person.
>

Reply | Threaded
Open this post in threaded view
|

Re: getrelid && list_nth

Mark Cave-Ayland-2
In reply to this post by Sandro Santilli-2
Hi strk,

> -----Original Message-----
> From: 'strk' [mailto:[hidden email]]
> Sent: 10 June 2004 13:24
> To: Mark Cave-Ayland
> Cc: [hidden email]
> Subject: Re: [postgis-devel] Re: getrelid && list_nth

(lots more cut)

> Interesting. Am I right saying this computation would make an
> extent MUCH smaller for your corner case ?

Well IANAM (I am not a mathematician) but the calculation *should* bias
results further away from the mean as being more erroneus according to a
normal distribution. Really need to dig out some of those old maths
textbooks ;) The test case is easy: load in a real data set and then add
a couple of really large points like POINT(1.0E20 1.0E20) and do an
ANALYZE; this will give a really unbalanced histogram which makes really
bad choices for query plans.

It would be useful to get some data and play in a spreadsheet but I
haven't got time to do that right now...... The other option would be to
simply filter out any data above the 2 SD threshold and not use it when
calculating the extents - but then we will lose a small percentage of
good data.... although we can increase the limit to 2.5 SDs to make sure
we lose as little as possible. Methinks some experimentation is required
:)

> We already make two scans BTW:
> @1357@
>         /*
>          * First scan:
>          *  o find extent of the sample rows
>          *  o count null/not-null values
>          *  o compute total_width
>          *  o compute total features's box area (for avgFeatureArea)
>          */
> @1357@
>         /*
>          * Second scan:
>          *  o fill histogram values with the number of
>          *    features' bbox overlaps: a feature's bvol
>          *    can fully overlap (1) or partially overlap
>          *    (fraction of 1) an histogram cell.
>          *
>          *  o compute total cells occupation
>          */
>  
> As you can see the first scan could also compute standard
> deviation and handle 0-extent case.

I'm not sure we can do this.... we need the mean first so that's one
iteration to find the mean before we can calculate the SD (second
iteration) .... and we need the SD to calculate the histogram extents
before we can do the main computation (third iteration). I'm not sure I
can see a way around this?
 

> Thinking deeper about these cases, we should probably
> abandone BoxesPerSide and use Columns / Rows instead as a
> bunch of points laying on the same horizontal line would
> require many coluns and a single row... We could calculate
> the width/height factor and use that to split the
> geometry_stats_target*160 in Rows and Columns. We would end
> up with always-near-square histogram cells but I don't see a
> big problem about it. Moreover, we could set a
> minimum-histogram-cell size which would in turn automate the
> extent-enrlargment you were suggesting.

Yes, that is a good point about having lots of points on the same
horizontal line.... another corner case!


Cheers,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.



Reply | Threaded
Open this post in threaded view
|

Re: getrelid && list_nth

Sandro Santilli-2
On Thu, Jun 10, 2004 at 02:04:13PM +0100, Mark Cave-Ayland wrote:

> Hi strk,
>
> > -----Original Message-----
> > From: 'strk' [mailto:[hidden email]]
> > Sent: 10 June 2004 13:24
> > To: Mark Cave-Ayland
> > Cc: [hidden email]
> > Subject: Re: [postgis-devel] Re: getrelid && list_nth
>
> (lots more cut)
>
> > Interesting. Am I right saying this computation would make an
> > extent MUCH smaller for your corner case ?
>
> Well IANAM (I am not a mathematician) but the calculation *should* bias
> results further away from the mean as being more erroneus according to a
> normal distribution. Really need to dig out some of those old maths
> textbooks ;) The test case is easy: load in a real data set and then add
> a couple of really large points like POINT(1.0E20 1.0E20) and do an
> ANALYZE; this will give a really unbalanced histogram which makes really
> bad choices for query plans.
>
> It would be useful to get some data and play in a spreadsheet but I
> haven't got time to do that right now...... The other option would be to
> simply filter out any data above the 2 SD threshold and not use it when
> calculating the extents - but then we will lose a small percentage of
> good data.... although we can increase the limit to 2.5 SDs to make sure
> we lose as little as possible. Methinks some experimentation is required
> :)

I've separated the estimator code in an
estimate_selectivity(BOX *search_box, GEOM_STATS *geomstats) function
for easier mantainance.

I've handled there complete containment and complete missing of histogram
extent by search box, which catches my single-point histogram and speeds up
some corner cases on 'normal' datasets avoiding histogram scan.

(cuts)

> > As you can see the first scan could also compute standard
> > deviation and handle 0-extent case.
>
> I'm not sure we can do this.... we need the mean first so that's one
> iteration to find the mean before we can calculate the SD (second
> iteration) .... and we need the SD to calculate the histogram extents
> before we can do the main computation (third iteration). I'm not sure I
> can see a way around this?

Nope. you are right. We can't. I was wrong.

I don't think making three scans is that bad, as we cache bounding boxes,
so no further fetch is required. Memory is occupied anyway ... and we
are vacuuming, so we have time (we provide hooks for interruptions also!).

I would not filter out sampled geometries, unless they are invalid
(!finite geoms shoul be added for example). I'll try your code
and make extent reduction be printed on DEBUG 1.

>  
> > Thinking deeper about these cases, we should probably
> > abandone BoxesPerSide and use Columns / Rows instead as a
> > bunch of points laying on the same horizontal line would
> > require many coluns and a single row... We could calculate
> > the width/height factor and use that to split the
> > geometry_stats_target*160 in Rows and Columns. We would end
> > up with always-near-square histogram cells but I don't see a
> > big problem about it. Moreover, we could set a
> > minimum-histogram-cell size which would in turn automate the
> > extent-enrlargment you were suggesting.
>
> Yes, that is a good point about having lots of points on the same
> horizontal line.... another corner case!

I think this is a general issue, as geometries have usually the
same resolution in both directions, so there's no point in squeezing
histogram cells... On the other hand having near-squared histogram
cells would increase estimator precision using the same number of total
cells.

--strk;

>
>
> Cheers,
>
> Mark.
>
> ---
>
> Mark Cave-Ayland
> Webbased Ltd.
> Tamar Science Park
> Derriford
> Plymouth
> PL6 8BX
> England
>
> Tel: +44 (0)1752 764445
> Fax: +44 (0)1752 764446
>
>
> This email and any attachments are confidential to the intended
> recipient and may also be privileged. If you are not the intended
> recipient please delete it from your system and notify the sender. You
> should not copy it or use it for any purpose nor disclose or distribute
> its contents to any other person.
>

Reply | Threaded
Open this post in threaded view
|

standard deviation based histogram extent reduction

Sandro Santilli-2
In reply to this post by Mark Cave-Ayland-2
On Thu, Jun 10, 2004 at 12:36:35PM +0100, Mark Cave-Ayland wrote:

(many cuts)

> The other corner case I have seen was when one of our import scripts
> went wrong and we ended up with several very large x/y coords in our
> table along with our normal data. When we did an ANALYZE, because the
> overall extents were so large, all of the geometries appeared in a
> single box in the bottom left hand corner of histogram and hence
> provided really bad query plans - deleting these erroneus geometries and
> doing ANALYZE solved the problem and everything went back to normal.
>
> The solution I thought of was to try and reduce the effect of outliers
> using some statistical theory: for example we know that for a random
> sample of data, 95% of the data lies within 2 standard deviations of the
> mean. So I was considering calculating the mean of all LLB.x/y and
> URT.x/y, using this to calculate the standard deviation, and hence
> calculate the geomextents from this. This should hopefully reduce the
> effect of excessively large coordinates on the extents of the
> selectivity histogram at the cost of making the histogram area slightly
> smaller than that currently indicated by the overall sample extents.
>
> The pseudo-code example for LLB.x would look something like:
>
> // Calculate mean
> for each LLB.x in sample
> sumLLB.x += LLB.x
> next
> meanLLB.x = sumLLB.x / samplerows
>
> // Calculate standard deviation
> for each LLB.x in sample
> s2 = (LLB.x - meanLLB.x) * (LLB.x - meanLLB.x)
> next
> s2 /= (samplerows - 1)
> s = sqrt(s2)
>
> // For 95% coverage we should lie within 2sds of the mean
> geomstats->xmin = meanLLB.x - 2 * s
>
>
> The downside is that we have to iterate through the sample tuples twice,
> once to calculate the mean and then once again to calculate the standard
> deviation. However, since we are only using a sample of rows from the
> table then hopefully the effect on larger tables will be negligible.
> Comments anyone?

I'm working on your suggested histogram extent reduction algorithm.

I've added an intermediate scan to compute standard deviation,
and I use that to compute histogram extent:

        geomstats->xmin = max((avgLOWx - 2 * sdLOWx), sample_extent->LLB.x);
        geomstats->ymin = max((avgLOWy - 2 * sdLOWy), sample_extent->LLB.y);
        geomstats->xmax = min((avgHIGx + 2 * sdHIGx), sample_extent->URT.x);
        geomstats->ymax = min((avgHIGy + 2 * sdHIGy), sample_extent->URT.y);

On the third scan (when histogram cells are given a value), every
feature is checked to still fall in the modified extent, thus skipping
'hard deviant' completely out.

Test case is: a bunch of equal points, an hard deviant one.
Samplep rows are: 3000.

When the hard deviant is not cought in the sample bucket, this
is what I get:

 NOTICE:   sample_extent: xmin,ymin: 10.000000,10.000000
 NOTICE:   sample_extent: xmax,ymax: 10.000000,10.000000
 NOTICE:   standard deviations:
 NOTICE:    LOWx: 10.000000 +- 0.000000
 NOTICE:    LOWy: 10.000000 +- 0.000000
 NOTICE:    HIGx: 10.000000 +- 0.000000
 NOTICE:    HIGy: 10.000000 +- 0.000000
 NOTICE:   histo: xmin,ymin: 10.000000,10.000000
 NOTICE:   histo: xmax,ymax: 10.000000,10.000000

When the hard deviant is cought, I get this:

 NOTICE:   sample_extent: xmin,ymin: 10.000000,10.000000
 NOTICE:   sample_extent: xmax,ymax: 14.000000,11000.000000
 NOTICE:   standard deviations:
 NOTICE:    LOWx - avg:10.001333 sd:0.073030
 NOTICE:    LOWy - avg:13.663333 sd:200.649030
 NOTICE:    HIGx - avg:10.001333 sd:0.073030
 NOTICE:    HIGy - avg:13.663333 sd:200.649030
 NOTICE:   feat 1481 is an hard deviant, skipped
 NOTICE:   histo: xmin,ymin: 10.000000,10.000000
 NOTICE:   histo: xmax,ymax: 10.147392,414.961395


After histogram extent reduction the hard deviant falls completely
outside of it.

I've handled this case in normalizzation skipping these features
and counting the actually examined ones.

Still, the extent has been enlarged too much!.
All examined features fall in the first cell (upper-left) of
the histogram, but this cell is very tall, and in order to
completely overlap it My search box has to be bigger then
actually needed.

Since this histogram is a vertical bar, if we used
a 1600*1 cells grid instead of a 40*40 the result would have been
more accurate.
Also, if we had re-computed histogram extent after the exclusion of the
'hard deviator' we would have make a better job, but this could end
up being a recursive operation ...

I haven't tested this yet on production data, but I've committed it,
to allow you to test with me.
To disable it, define USE_STANDARD_DEVIATION to 0 at top file.

The utils/ directory contains a script to test accuracy of the estimator.

Mark, it's a pleasure working with you :)

--strk;