Reputation: 5611
I am trying to produce a table containing counts of non-null datapoints for columns in the "Area Health Resource File" -- which contains per-county demographic and health data.
I have reworked the data into timeseries from the provided format, resulting in a bunch of tables named "series_" for some data category foo, and rows identified by county FIPS and year (initial and final for multiyear surveys).
Now want to produce counts over the timeseries columns. So far the query I have is:
do language plpgsql $$
declare
query text;
begin
query := (with cats as (
select tcategory, format('series_%s', tcategory) series_tbl
from series_categories),
cols as (
select tcategory, series_tbl, attname col
from pg_attribute a join pg_class r on a.attrelid = r.oid
join cats c on c.series_tbl = r.relname
where attname not in ('FIPS', 'initial', 'final')
and attnum >= 0
order by tcategory, col),
scols as (
select tcategory, series_tbl, col,
format('count(%s)', quote_ident(col)) sel
from cols),
sel as (
select format(
E' (select %s tcategory, %s col, %s from %s)\n',
quote_literal(tcategory), quote_literal(col), sel, series_tbl) q
from scols)
select string_agg(q, E'union\n') from sel);
execute format(
'select * into category_column_counts from (%s) x', query);
end;
$$;
(Here the "series_categories" table has category name.)
This ... "works" but is probably hundreds of times too slow. Its doing ~10,000
individual tablescans, which could be reduced 500-fold, as there are only 20-ish
categories. I would like to use select count(col1), count(col2) ...
for each table, then "unnest" these row records and concatenate all together.
I haven't figured it out though. I looked at: https://stackoverflow.com/a/14087244/435563
for inspiration, but haven't transformed that successfully.
Upvotes: 0
Views: 119
Reputation: 32276
I don't know the AHRF format (I looked up the web site but there are too many cute nurse pictures for me to focus on the content...) but you are probably going it the wrong way in first extracting data into multiple tables and then trying to piece it back together again. Instead, you should use a design pattern called Entity-Attribute-Value that stores all the data values in a single table with a category identifier and a "feature" identifier, with table structures somewhat like this:
CREATE TABLE categories (
id serial PRIMARY KEY,
category text NOT NULL,
... -- other attributes like min/max allowable values, measurement technique, etc.
);
CREATE TABLE feature ( -- town, county, state, whatever
id serial PRIMARY KEY,
fips varchar NOT NULL,
name varchar,
... -- other attributes
);
CREATE TABLE measurement (
feature integer REFERENCES feature,
category integer REFERENCES categories,
dt date,
value double precision NOT NULL,
PRIMARY KEY (feature, category, dt)
);
This design pattern is very flexible. For instance, you can store 50 categories for some rows of one feature class and only 5 for another set of rows. You can store data from multiple observations on different dates or years. You can have multiple "feature" tables with separate "measurement" tables, or you can set it up with table inheritance.
Answering your query is then very straightforward using standard PK-FK relationships. More to the point, answering any query is far easier than with your current structure of divide-but-not-conquer.
I don't know exactly how your "initial year"\"final year" data works, but otherwise your requirement would be met by a simple query like so:
SELECT f.fips, c.category, count(*)
FROM feature f -- replace feature by whatever real table you create, like "county"
JOIN measurement m ON m.feature = f.id
JOIN categories c ON c.id = m.category
GROUP BY f.fips, c.category;
Do you want to know dental decay as a function of smoking, alcohol consumption versus psychiatric help, correlation between obesity and substance abuse, trend in toddler development? All fairly easy with the above structure, all a slow painful trod with multiple tables.
Upvotes: 1
Reputation: 5611
Here is the optimization I found: it uses json_each(row_to_json(c))
to turn records into sequences of individual values.
do language plpgsql $$
declare
query text;
begin
query := (with cats as (
select tcategory, table_name
from series_category_tables),
cols as (
select tcategory, table_name, attname col, typname type_name
from pg_attribute a join pg_class r on a.attrelid = r.oid
join cats c on c.table_name = r.relname
join pg_type t on t.oid = a.atttypid
where attname not in ('FIPS', 'initial', 'final')
and attnum >= 0
order by tcategory, col),
-- individual "count" fields
sel as (
select
format(
E' (select %s tcategory, %s table_name, \n'
|| E' d.key column_name, d.value->>''f2'' type_name, '
|| E'(d.value->>''f1'')::int count\n'
|| E' from (\n'
|| E' select (json_each(row_to_json(c))).* from (select\n'
|| E' %s \n'
|| E' from %s) c) d)\n',
quote_literal(tcategory),
quote_literal(table_name),
string_agg(
format(
' row(count(%1$s), %2$s) %1$s',
quote_ident(col), quote_literal(type_name)),
E',\n'), quote_ident(table_name)) selstr
from cols
group by tcategory, table_name),
selu as (
select
string_agg(selstr, E'union\n') selu
from sel)
select * from selu);
drop table if exists category_columns;
create table category_columns (
tcategory text, table_name text,
column_name text, type_name text, count int);
execute format(
'insert into category_columns select * from (%s) x', query);
end;
$$;
It runs in ~45 seconds vs 6 minutes for the previous version. Can I/you do better than this?
Upvotes: 0