RE: standard deviation based histogram extent reduction

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

RE: standard deviation based histogram extent reduction

Mark Cave-Ayland-2
Hi strk,

> -----Original Message-----
> From: 'strk' [mailto:[hidden email]]
> Sent: 10 June 2004 16:42
> To: Mark Cave-Ayland
> Cc: [hidden email]
> Subject: standard deviation based histogram extent reduction

(big cut)

> I'm working on your suggested histogram extent reduction algorithm.
>
> I've added an intermediate scan to compute standard
> deviation, and I use that to compute histogram extent:
>
>         geomstats->xmin = max((avgLOWx - 2 * sdLOWx),
> sample_extent->LLB.x);
>         geomstats->ymin = max((avgLOWy - 2 * sdLOWy),
> sample_extent->LLB.y);
>         geomstats->xmax = min((avgHIGx + 2 * sdHIGx),
> sample_extent->URT.x);
>         geomstats->ymax = min((avgHIGy + 2 * sdHIGy),
> sample_extent->URT.y);
>
> On the third scan (when histogram cells are given a value), every
> feature is checked to still fall in the modified extent, thus
> skipping 'hard deviant' completely out.
>
> Test case is: a bunch of equal points, an hard deviant one.
> Samplep rows are: 3000.
>
> When the hard deviant is not cought in the sample bucket,
> this is what I get:
>
>  NOTICE:   sample_extent: xmin,ymin: 10.000000,10.000000
>  NOTICE:   sample_extent: xmax,ymax: 10.000000,10.000000
>  NOTICE:   standard deviations:
>  NOTICE:    LOWx: 10.000000 +- 0.000000
>  NOTICE:    LOWy: 10.000000 +- 0.000000
>  NOTICE:    HIGx: 10.000000 +- 0.000000
>  NOTICE:    HIGy: 10.000000 +- 0.000000
>  NOTICE:   histo: xmin,ymin: 10.000000,10.000000
>  NOTICE:   histo: xmax,ymax: 10.000000,10.000000
>
> When the hard deviant is cought, I get this:
>
>  NOTICE:   sample_extent: xmin,ymin: 10.000000,10.000000
>  NOTICE:   sample_extent: xmax,ymax: 14.000000,11000.000000
>  NOTICE:   standard deviations:
>  NOTICE:    LOWx - avg:10.001333 sd:0.073030
>  NOTICE:    LOWy - avg:13.663333 sd:200.649030
>  NOTICE:    HIGx - avg:10.001333 sd:0.073030
>  NOTICE:    HIGy - avg:13.663333 sd:200.649030
>  NOTICE:   feat 1481 is an hard deviant, skipped

>  NOTICE:   histo: xmin,ymin: 10.000000,10.000000
>  NOTICE:   histo: xmax,ymax: 10.147392,414.961395

Great! So this is the rectangle in which we are 95% sure any data inside
here is valid... so why don't we use this as the cut-off filter for the
histogram bounding box too? It would mean a four pass algorithm but....
:)

So keep passes 1 & 2 as they are at the moment. On pass 3, we calculate
the bounding box of the histogram as the original algorithm did,
*except* we do not include any geometries which lie outside of the
xmin,ymin,xmax,ymax rectangle calculated above as these are assumed to
be noise. Finally on pass 4, we populate the histogram, again ignoring
values which lie outside of this range.

You may even find that we can increase the level to 3 * SDs from the
mean and still manage to cut out the outliers - this puts us at being
somewhere around 99% certain that any data within the bounding rectangle
is valid. Which is nice.

> Still, the extent has been enlarged too much!.
> All examined features fall in the first cell (upper-left) of
> the histogram, but this cell is very tall, and in order to
> completely overlap it My search box has to be bigger then
> actually needed.
>
> Since this histogram is a vertical bar, if we used
> a 1600*1 cells grid instead of a 40*40 the result would have
> been more accurate. Also, if we had re-computed histogram
> extent after the exclusion of the 'hard deviator' we would
> have make a better job, but this could end up being a
> recursive operation ...

Hopefully this will be solved by cutting the outliers out of the data
from the histogram bounding box too (see above).

> I haven't tested this yet on production data, but I've
> committed it, to allow you to test with me. To disable it,
> define USE_STANDARD_DEVIATION to 0 at top file.
>
> The utils/ directory contains a script to test accuracy of
> the estimator.
>
> Mark, it's a pleasure working with you :)

Seriously strk, this is all great stuff! I think that everyone, and not
just myself, appreciates the effort you've put into this. I look forward
to doing some more serious testing on this when I have more time on my
hands :)


Cheers,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.



Reply | Threaded
Open this post in threaded view
|

RE: standard deviation based histogram extent reduction

Sandro Santilli-2
I've added irregular sized histogram grid (cell aspect is always
nearly square keeping total cells near to requested precision).

I'd like to have some test results before working on other
refinements. Just to make sure we are not introducing any bug.

Thanks for you attention.

--strk;

On Thu, Jun 10, 2004 at 05:40:06PM +0100, Mark Cave-Ayland wrote:

> Hi strk,
>
> > -----Original Message-----
> > From: 'strk' [mailto:[hidden email]]
> > Sent: 10 June 2004 16:42
> > To: Mark Cave-Ayland
> > Cc: [hidden email]
> > Subject: standard deviation based histogram extent reduction
>
> (big cut)
>
> > I'm working on your suggested histogram extent reduction algorithm.
> >
> > I've added an intermediate scan to compute standard
> > deviation, and I use that to compute histogram extent:
> >
> >         geomstats->xmin = max((avgLOWx - 2 * sdLOWx),
> > sample_extent->LLB.x);
> >         geomstats->ymin = max((avgLOWy - 2 * sdLOWy),
> > sample_extent->LLB.y);
> >         geomstats->xmax = min((avgHIGx + 2 * sdHIGx),
> > sample_extent->URT.x);
> >         geomstats->ymax = min((avgHIGy + 2 * sdHIGy),
> > sample_extent->URT.y);
> >
> > On the third scan (when histogram cells are given a value), every
> > feature is checked to still fall in the modified extent, thus
> > skipping 'hard deviant' completely out.
> >
> > Test case is: a bunch of equal points, an hard deviant one.
> > Samplep rows are: 3000.
> >
> > When the hard deviant is not cought in the sample bucket,
> > this is what I get:
> >
> >  NOTICE:   sample_extent: xmin,ymin: 10.000000,10.000000
> >  NOTICE:   sample_extent: xmax,ymax: 10.000000,10.000000
> >  NOTICE:   standard deviations:
> >  NOTICE:    LOWx: 10.000000 +- 0.000000
> >  NOTICE:    LOWy: 10.000000 +- 0.000000
> >  NOTICE:    HIGx: 10.000000 +- 0.000000
> >  NOTICE:    HIGy: 10.000000 +- 0.000000
> >  NOTICE:   histo: xmin,ymin: 10.000000,10.000000
> >  NOTICE:   histo: xmax,ymax: 10.000000,10.000000
> >
> > When the hard deviant is cought, I get this:
> >
> >  NOTICE:   sample_extent: xmin,ymin: 10.000000,10.000000
> >  NOTICE:   sample_extent: xmax,ymax: 14.000000,11000.000000
> >  NOTICE:   standard deviations:
> >  NOTICE:    LOWx - avg:10.001333 sd:0.073030
> >  NOTICE:    LOWy - avg:13.663333 sd:200.649030
> >  NOTICE:    HIGx - avg:10.001333 sd:0.073030
> >  NOTICE:    HIGy - avg:13.663333 sd:200.649030
> >  NOTICE:   feat 1481 is an hard deviant, skipped
>
> >  NOTICE:   histo: xmin,ymin: 10.000000,10.000000
> >  NOTICE:   histo: xmax,ymax: 10.147392,414.961395
>
> Great! So this is the rectangle in which we are 95% sure any data inside
> here is valid... so why don't we use this as the cut-off filter for the
> histogram bounding box too? It would mean a four pass algorithm but....
> :)
>
> So keep passes 1 & 2 as they are at the moment. On pass 3, we calculate
> the bounding box of the histogram as the original algorithm did,
> *except* we do not include any geometries which lie outside of the
> xmin,ymin,xmax,ymax rectangle calculated above as these are assumed to
> be noise. Finally on pass 4, we populate the histogram, again ignoring
> values which lie outside of this range.
>
> You may even find that we can increase the level to 3 * SDs from the
> mean and still manage to cut out the outliers - this puts us at being
> somewhere around 99% certain that any data within the bounding rectangle
> is valid. Which is nice.
>
> > Still, the extent has been enlarged too much!.
> > All examined features fall in the first cell (upper-left) of
> > the histogram, but this cell is very tall, and in order to
> > completely overlap it My search box has to be bigger then
> > actually needed.
> >
> > Since this histogram is a vertical bar, if we used
> > a 1600*1 cells grid instead of a 40*40 the result would have
> > been more accurate. Also, if we had re-computed histogram
> > extent after the exclusion of the 'hard deviator' we would
> > have make a better job, but this could end up being a
> > recursive operation ...
>
> Hopefully this will be solved by cutting the outliers out of the data
> from the histogram bounding box too (see above).
>
> > I haven't tested this yet on production data, but I've
> > committed it, to allow you to test with me. To disable it,
> > define USE_STANDARD_DEVIATION to 0 at top file.
> >
> > The utils/ directory contains a script to test accuracy of
> > the estimator.
> >
> > Mark, it's a pleasure working with you :)
>
> Seriously strk, this is all great stuff! I think that everyone, and not
> just myself, appreciates the effort you've put into this. I look forward
> to doing some more serious testing on this when I have more time on my
> hands :)
>
>
> Cheers,
>
> Mark.
>
> ---
>
> Mark Cave-Ayland
> Webbased Ltd.
> Tamar Science Park
> Derriford
> Plymouth
> PL6 8BX
> England
>
> Tel: +44 (0)1752 764445
> Fax: +44 (0)1752 764446
>
>
> This email and any attachments are confidential to the intended
> recipient and may also be privileged. If you are not the intended
> recipient please delete it from your system and notify the sender. You
> should not copy it or use it for any purpose nor disclose or distribute
> its contents to any other person.
>
>
> _______________________________________________
> postgis-devel mailing list
> [hidden email]
> http://postgis.refractions.net/mailman/listinfo/postgis-devel