Reputation: 2641
Say we have this table things_happened with schema looks like this:
CREATE TABLE things_happened
(
zipcode character varying(10),
city character varying(50),
state character varying(2),
metro character varying(50),
countyname character varying(50),
"1996-04" integer,
"1996-05" integer,
"1996-06" integer,
"1996-07" integer,
"1996-08" integer,
"1996-09" integer,
...
"2014-09" integer,
"2014-10" integer,
"2014-11" integer
)
It looks funny because the data is imported from csv files by other folks.
It's obvious that this table is not efficient, the values of lots of months are empty for a particular area. So I was thinking creating two tables out of it.
The schemas of desired two tables are:
area_info (zipcode, city, state, metro, countyname) with zipcode as primary key
things_happened_per_month (year, month, zipcode, times) with year, month, zipcode as primary key
Because the size of the table is big and the data keeps coming and the column name has to become a parameter, I was wondering how to use "PL/pgSQL - SQL Procedural Language" to do it? Or any other efficient solution?
Upvotes: 0
Views: 129
Reputation: 32274
Your table things_happened looks like a pivot table and you want to normalize that to a more efficient data structure. You will have to write a PlPgSQL function to do that.
Since you have many months, and possibly more columns added for later months, I suggest you dynamically determine the month columns in the table and then loop over the results. In the example below I assume that you have already copied the area_info data into its own table; I focus here on the "times" column in the thpm table (which I assume you have already created).
The below function uses a dynamic look-up of the YYYY-DD columns in the table and then loops over records and columns to get the data into the normalized table. (Many thanks to Pavel Stehule for pointing out the last finicky error in the code.)
CREATE FUNCTION normalize_things_happened() RETURNS void AS $$
DECLARE
col_names text[];
period text;
th things_happened%rowtype;
times integer;
BEGIN
-- Get the currently present columns from the catalog
SELECT array_agg(attname::text) INTO col_names
FROM pg_attribute att
JOIN pg_class c ON c.oid = att.attrelid
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relname = 'things_happened'
AND n.nspname = 'public'
AND position('-' in attname) = 5; -- only "times" columns
-- Loop over all the rows in the things_happened table
FOR th IN SELECT * FROM things_happened LOOP
-- Now loop over column names
FOREACH period IN ARRAY col_names LOOP
-- Fudge the proper column from the th record into a local variable
EXECUTE 'SELECT $1.' || quote_ident(period) INTO times USING th;
-- If times is a proper value, insert it into the thpm table
IF times IS NOT NULL THEN
INSERT INTO things_happened_per_month (year, month, zipcode, times) VALUES
(substring(period from 1 for 4)::int, substring(period from 6 for 2)::int, th.zipcode, times);
END IF;
END LOOP;
END LOOP;
END; $$ LANGUAGE plpgsql;
This should work as a one-off exercise. If the original table keeps on getting new data, you should run this function periodically and then do an UPSERT
in the innermost EXECUTE
: first try to do an UPDATE
of the "times" value, if that fails because there is no data for the year, month, zipcode combination, then do an INSERT
. See many other questions here on SO for examples.
The below function is the non-dynamic variant. You will have to put separate blocks of code for every month in the things_happened table.
CREATE FUNCTION normalize_things_happened() RETURNS void AS $$
DECLARE
th things_happened%rowtype;
times integer;
BEGIN
-- Loop over all the rows in the things_happened table
FOR th IN SELECT * FROM things_happened LOOP
-- Copy the below block for 1996, April, for all other months.
SELECT th."1996-04" INTO times;
IF times IS NOT NULL THEN
INSERT INTO things_happened_per_month (year, month, zipcode, times) VALUES (1996, 4, th.zipcode, times);
-- 1996, May
SELECT th."1996-05" INTO times;
IF times IS NOT NULL THEN
INSERT INTO things_happened_per_month (year, month, zipcode, times) VALUES (1996, 5, th.zipcode, times);
END IF;
-- Etc.
END LOOP;
END; $$ LANGUAGE plpgsql;
Ugly, but functional.
Upvotes: 1