shaunc
shaunc

Reputation: 5611

How to create postgres query to generate counts of columns where tables specified as data

I am trying to produce a table containing counts of non-null datapoints for columns in the "Area Health Resource File" -- which contains per-county demographic and health data.

I have reworked the data into timeseries from the provided format, resulting in a bunch of tables named "series_" for some data category foo, and rows identified by county FIPS and year (initial and final for multiyear surveys).

Now want to produce counts over the timeseries columns. So far the query I have is:

do language plpgsql $$
declare
  query text;
begin
  query := (with cats as (
    select tcategory, format('series_%s', tcategory) series_tbl
    from series_categories),
  cols as (
    select tcategory, series_tbl, attname col
    from pg_attribute a join pg_class r on a.attrelid = r.oid
      join cats c on c.series_tbl = r.relname
    where attname not in ('FIPS', 'initial', 'final')
    and attnum >= 0
    order by tcategory, col),
  scols as (
    select tcategory, series_tbl, col,
    format('count(%s)', quote_ident(col)) sel
    from cols),
  sel as (
    select format(
      E'  (select %s tcategory, %s col, %s from %s)\n', 
        quote_literal(tcategory), quote_literal(col), sel, series_tbl) q
   from scols)
  select string_agg(q, E'union\n') from sel);
  execute format(
    'select * into category_column_counts from (%s) x', query);
end;
$$;

(Here the "series_categories" table has category name.) This ... "works" but is probably hundreds of times too slow. Its doing ~10,000 individual tablescans, which could be reduced 500-fold, as there are only 20-ish categories. I would like to use select count(col1), count(col2) ... for each table, then "unnest" these row records and concatenate all together.

I haven't figured it out though. I looked at: https://stackoverflow.com/a/14087244/435563

for inspiration, but haven't transformed that successfully.

Upvotes: 0

Views: 119

Answers (2)

Patrick
Patrick

Reputation: 32276

I don't know the AHRF format (I looked up the web site but there are too many cute nurse pictures for me to focus on the content...) but you are probably going it the wrong way in first extracting data into multiple tables and then trying to piece it back together again. Instead, you should use a design pattern called Entity-Attribute-Value that stores all the data values in a single table with a category identifier and a "feature" identifier, with table structures somewhat like this:

CREATE TABLE categories (
  id          serial PRIMARY KEY,
  category    text NOT NULL,
  ... -- other attributes like min/max allowable values, measurement technique, etc.
);

CREATE TABLE feature ( -- town, county, state, whatever
  id          serial PRIMARY KEY,
  fips        varchar NOT NULL,
  name        varchar,
  ... -- other attributes
);

CREATE TABLE measurement (
  feature     integer REFERENCES feature,
  category    integer REFERENCES categories,
  dt          date,
  value       double precision NOT NULL,
  PRIMARY KEY (feature, category, dt)
);

This design pattern is very flexible. For instance, you can store 50 categories for some rows of one feature class and only 5 for another set of rows. You can store data from multiple observations on different dates or years. You can have multiple "feature" tables with separate "measurement" tables, or you can set it up with table inheritance.

Answering your query is then very straightforward using standard PK-FK relationships. More to the point, answering any query is far easier than with your current structure of divide-but-not-conquer.

I don't know exactly how your "initial year"\"final year" data works, but otherwise your requirement would be met by a simple query like so:

SELECT f.fips, c.category, count(*) 
FROM feature f -- replace feature by whatever real table you create, like "county"
JOIN measurement m ON m.feature = f.id
JOIN categories c ON c.id = m.category
GROUP BY f.fips, c.category;

Do you want to know dental decay as a function of smoking, alcohol consumption versus psychiatric help, correlation between obesity and substance abuse, trend in toddler development? All fairly easy with the above structure, all a slow painful trod with multiple tables.

Upvotes: 1

shaunc
shaunc

Reputation: 5611

Here is the optimization I found: it uses json_each(row_to_json(c)) to turn records into sequences of individual values.

do language plpgsql $$
declare
  query text;
begin
  query := (with cats as (
    select tcategory, table_name
    from series_category_tables),
  cols as (
    select tcategory, table_name, attname col, typname type_name
    from pg_attribute a join pg_class r on a.attrelid = r.oid
      join cats c on c.table_name = r.relname
      join pg_type t on t.oid = a.atttypid
    where attname not in ('FIPS', 'initial', 'final')
    and attnum >= 0
    order by tcategory, col),
  -- individual "count" fields
  sel as (
    select 
      format(
        E'  (select %s tcategory, %s table_name, \n'
        || E'    d.key column_name, d.value->>''f2'' type_name, '
        || E'(d.value->>''f1'')::int count\n'
        || E'  from (\n'
        || E'    select (json_each(row_to_json(c))).* from (select\n' 
        || E'      %s \n'
        || E'    from %s) c) d)\n',
      quote_literal(tcategory),
      quote_literal(table_name),
      string_agg(
        format(
          '      row(count(%1$s), %2$s) %1$s',
          quote_ident(col), quote_literal(type_name)), 
        E',\n'), quote_ident(table_name)) selstr
    from cols
    group by tcategory, table_name),
  selu as (
    select 
      string_agg(selstr, E'union\n') selu
    from sel)
  select * from selu);
  drop table if exists category_columns;
  create table category_columns (
    tcategory text, table_name text, 
    column_name text, type_name text, count int);
  execute format(
    'insert into category_columns select *  from (%s) x', query);

end;
$$;

It runs in ~45 seconds vs 6 minutes for the previous version. Can I/you do better than this?

Upvotes: 0

Related Questions