How to use PL/pgSQL to transform an original table to desired two tables in postgresql?

Question

Say we have this table things_happened with schema looks like this:

CREATE TABLE things_happened
(
  zipcode character varying(10),
  city character varying(50),
  state character varying(2),
  metro character varying(50),
  countyname character varying(50),
  "1996-04" integer,
  "1996-05" integer,
  "1996-06" integer,
  "1996-07" integer,
  "1996-08" integer,
  "1996-09" integer,
  ...
  "2014-09" integer,
  "2014-10" integer,
  "2014-11" integer
)

It looks funny because the data is imported from csv files by other folks.

It's obvious that this table is not efficient, the values of lots of months are empty for a particular area. So I was thinking creating two tables out of it.

The schemas of desired two tables are:

area_info (zipcode, city, state, metro, countyname) with zipcode as primary key
things_happened_per_month (year, month, zipcode, times) with year, month, zipcode as primary key

Because the size of the table is big and the data keeps coming and the column name has to become a parameter, I was wondering how to use "PL/pgSQL - SQL Procedural Language" to do it? Or any other efficient solution?

Patrick · Accepted Answer

Your table things_happened looks like a pivot table and you want to normalize that to a more efficient data structure. You will have to write a PlPgSQL function to do that.

Since you have many months, and possibly more columns added for later months, I suggest you dynamically determine the month columns in the table and then loop over the results. In the example below I assume that you have already copied the area_info data into its own table; I focus here on the "times" column in the thpm table (which I assume you have already created).

Dynamic solution

The below function uses a dynamic look-up of the YYYY-DD columns in the table and then loops over records and columns to get the data into the normalized table. (Many thanks to Pavel Stehule for pointing out the last finicky error in the code.)

CREATE FUNCTION normalize_things_happened() RETURNS void AS $$
DECLARE
  col_names text[];
  period    text;
  th        things_happened%rowtype;
  times     integer;
BEGIN
  -- Get the currently present columns from the catalog
  SELECT array_agg(attname::text) INTO col_names
  FROM pg_attribute att
  JOIN pg_class c ON c.oid = att.attrelid
  JOIN pg_namespace n ON n.oid = c.relnamespace
  WHERE c.relname = 'things_happened'
    AND n.nspname = 'public'
    AND position('-' in attname) = 5; -- only "times" columns

  -- Loop over all the rows in the things_happened table
  FOR th IN SELECT * FROM things_happened LOOP
    -- Now loop over column names
    FOREACH period IN ARRAY col_names LOOP
      -- Fudge the proper column from the th record into a local variable
      EXECUTE 'SELECT $1.' || quote_ident(period) INTO times USING th;

      -- If times is a proper value, insert it into the thpm table
      IF times IS NOT NULL THEN
        INSERT INTO things_happened_per_month (year, month, zipcode, times) VALUES
          (substring(period from 1 for 4)::int, substring(period from 6 for 2)::int, th.zipcode, times);
      END IF;
    END LOOP;
  END LOOP;
END; $$ LANGUAGE plpgsql;

This should work as a one-off exercise. If the original table keeps on getting new data, you should run this function periodically and then do an UPSERT in the innermost EXECUTE: first try to do an UPDATE of the "times" value, if that fails because there is no data for the year, month, zipcode combination, then do an INSERT. See many other questions here on SO for examples.

Static solution

The below function is the non-dynamic variant. You will have to put separate blocks of code for every month in the things_happened table.

CREATE FUNCTION normalize_things_happened() RETURNS void AS $$
DECLARE
  th        things_happened%rowtype;
  times     integer;
BEGIN
  -- Loop over all the rows in the things_happened table
  FOR th IN SELECT * FROM things_happened LOOP
    -- Copy the below block for 1996, April, for all other months.
    SELECT th."1996-04" INTO times;
    IF times IS NOT NULL THEN
      INSERT INTO things_happened_per_month (year, month, zipcode, times) VALUES (1996, 4, th.zipcode, times);
    -- 1996, May
    SELECT th."1996-05" INTO times;
    IF times IS NOT NULL THEN
      INSERT INTO things_happened_per_month (year, month, zipcode, times) VALUES (1996, 5, th.zipcode, times);
    END IF;
    -- Etc.
  END LOOP;
END; $$ LANGUAGE plpgsql;

Ugly, but functional.

How to use PL/pgSQL to transform an original table to desired two tables in postgresql?

Answers (1)

Dynamic solution

Static solution

Related Questions