Query Performance Question

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Query Performance Question

Stephen Woodbridge
Hi all,

I'm working on improving the performance of the new address standardizer
and I'm trying to figure out what is the "secret sauce" to make the
queries fastest.

My function is declared below for reference. Here are the timing on
various queries:

Query 1:
as_test=# select * from as_standardize(
     '123 oak ln, st marie ny usa',
     (select grammar from as_config where countrycode='us'),
     (select lexicon from as_config where countrycode='us'),
     'en_US',
     (select filter from as_config where countrycode='us')
);
Time: 593.482 ms

Query 2:
as_test=# select (as_standardize(
     '123 oak ln, st marie ny usa',
     (select grammar from as_config where countrycode='us'),
     (select lexicon from as_config where countrycode='us'),
     'en_US',
     (select filter from as_config where countrycode='us'))
).*;
Time: 9610.006 ms

The function return 16 columns and 16 * .5935 sec ~= 9.610 sec

Query 3:
as_test=# with cfg as (
    select '123 oak ln, st marie ny usa'::text as addr, *
      from as_config where countrycode='us'
)
select addr, (as_standardize(addr, grammar, clexicon, 'en_US',
filter)).* from cfg;
Time: 2533.146 ms

The real problem is when I want to standardize a whole table like:

Query 4:
as_test=# with cfg as (
    select * from as_config where countrycode='us'
)
select t.id, t.addr, (as_standardize(t.addr, cfg.grammar,
                                 cfg.clexicon, 'en_US', cfg.filter)).*
   from tmp_addresses t, cfg
   order by t.id;
Time: 46537.589 ms

The tmp_addresses table has 53 records so its taking on average 0.878
sec / record.
The as_config table has 25 records in it, but we only use 1.

I've already implemented a query level cache for the grammar and lexicon
objects which are expensive to construct and that improved performance.

I'm having trouble reconciling the various timings. It is possible that
the sum of the timings for the 53 records really is about 46.5 sec but
that does not explain the differences between Query 1, 2, and 3. The
difference between Query 1 and 2 seems obvious.

Is there a better way to standardize a whole table of addresses? These
issues aside, I'm still looking at ways to improve the performance of
the underlying C++ code.

The code can be found here if your interested:
https://github.com/woodbri/address-standardizer
http://imaptools.com:8080/dl/address-standardizer2/ (windows 64-bit)

Thoughts appreciated,
   -Steve

CREATE OR REPLACE FUNCTION as_standardize(
         address text,
         grammar text,
         lexicon text,
         locale text,
         filter text,
         OUT building text,
         OUT house_num text,
         OUT predir text,
         OUT qual text,
         OUT pretype text,
         OUT name text,
         OUT suftype text,
         OUT sufdir text,
         OUT ruralroute text,
         OUT extra text,
         OUT city text,
         OUT prov text,
         OUT country text,
         OUT postcode text,
         OUT box text,
         OUT unit text
         )
     RETURNS RECORD
     AS '$libdir/address_standardizer2-2.0', 'as_standardize'
     LANGUAGE 'c' STABLE STRICT;

I have also tried IMMUTABLE instead of STABLE but I get the same timing
results.

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel
Reply | Threaded
Open this post in threaded view
|

Re: Query Performance Question

Stephen Woodbridge
OK, for the record, I modified the queries to look like the following
and was able to standardize a table of 226 records in about 19 sec or 88
ms per record on average.

-- before PostgreSQL 9.3 use a query like this
with tmp as (select a.id, address, as_standardize(
                  address,
                  grammar,
                  clexicon,
                  'en_US',
                  filter) as std
              from as_config cfg, test_addresses a
             where cfg.countrycode='us')
select id, (std).*, address
   from tmp
  order by id;

-- PostgreSQL 9.3+ use a query like this
select a.id, std.*, address
   from test_addresses as a,
        as_config cfg,
        LATERAL as_standardize(
             address,
             grammar,
             clexicon,
             'en_US',
             filter
         ) as std
  where cfg.countrycode='us'
  order by a.id;

So this is very close to the performance I get running against a command
line tester version of the address standardizer reading the grammar and
lexicon from files.

-Steve

On 10/13/2016 4:27 PM, Stephen Woodbridge wrote:

> Hi all,
>
> I'm working on improving the performance of the new address standardizer
> and I'm trying to figure out what is the "secret sauce" to make the
> queries fastest.
>
> My function is declared below for reference. Here are the timing on
> various queries:
>
> Query 1:
> as_test=# select * from as_standardize(
>     '123 oak ln, st marie ny usa',
>     (select grammar from as_config where countrycode='us'),
>     (select lexicon from as_config where countrycode='us'),
>     'en_US',
>     (select filter from as_config where countrycode='us')
> );
> Time: 593.482 ms
>
> Query 2:
> as_test=# select (as_standardize(
>     '123 oak ln, st marie ny usa',
>     (select grammar from as_config where countrycode='us'),
>     (select lexicon from as_config where countrycode='us'),
>     'en_US',
>     (select filter from as_config where countrycode='us'))
> ).*;
> Time: 9610.006 ms
>
> The function return 16 columns and 16 * .5935 sec ~= 9.610 sec
>
> Query 3:
> as_test=# with cfg as (
>    select '123 oak ln, st marie ny usa'::text as addr, *
>      from as_config where countrycode='us'
> )
> select addr, (as_standardize(addr, grammar, clexicon, 'en_US',
> filter)).* from cfg;
> Time: 2533.146 ms
>
> The real problem is when I want to standardize a whole table like:
>
> Query 4:
> as_test=# with cfg as (
>    select * from as_config where countrycode='us'
> )
> select t.id, t.addr, (as_standardize(t.addr, cfg.grammar,
>                                 cfg.clexicon, 'en_US', cfg.filter)).*
>   from tmp_addresses t, cfg
>   order by t.id;
> Time: 46537.589 ms
>
> The tmp_addresses table has 53 records so its taking on average 0.878
> sec / record.
> The as_config table has 25 records in it, but we only use 1.
>
> I've already implemented a query level cache for the grammar and lexicon
> objects which are expensive to construct and that improved performance.
>
> I'm having trouble reconciling the various timings. It is possible that
> the sum of the timings for the 53 records really is about 46.5 sec but
> that does not explain the differences between Query 1, 2, and 3. The
> difference between Query 1 and 2 seems obvious.
>
> Is there a better way to standardize a whole table of addresses? These
> issues aside, I'm still looking at ways to improve the performance of
> the underlying C++ code.
>
> The code can be found here if your interested:
> https://github.com/woodbri/address-standardizer
> http://imaptools.com:8080/dl/address-standardizer2/ (windows 64-bit)
>
> Thoughts appreciated,
>   -Steve
>
> CREATE OR REPLACE FUNCTION as_standardize(
>         address text,
>         grammar text,
>         lexicon text,
>         locale text,
>         filter text,
>         OUT building text,
>         OUT house_num text,
>         OUT predir text,
>         OUT qual text,
>         OUT pretype text,
>         OUT name text,
>         OUT suftype text,
>         OUT sufdir text,
>         OUT ruralroute text,
>         OUT extra text,
>         OUT city text,
>         OUT prov text,
>         OUT country text,
>         OUT postcode text,
>         OUT box text,
>         OUT unit text
>         )
>     RETURNS RECORD
>     AS '$libdir/address_standardizer2-2.0', 'as_standardize'
>     LANGUAGE 'c' STABLE STRICT;
>
> I have also tried IMMUTABLE instead of STABLE but I get the same timing
> results.
>
> ---
> This email has been checked for viruses by Avast antivirus software.
> https://www.avast.com/antivirus
>
> _______________________________________________
> postgis-devel mailing list
> [hidden email]
> http://lists.osgeo.org/mailman/listinfo/postgis-devel


---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
postgis-devel mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/postgis-devel