Andy N
Andy N

Reputation: 1304

SQL performance: Inserting one table into two

During my career, I've come across many instance of having to insert flat, denormalized data into a normalised structure.

To accomplish this I've often used CTE inserts. E.g.

CREATE TABLE raw_data (
    foo varchar,
    bar_1 varchar,
    bar_2 varchar
);

INSERT INTO raw_data VALUES ('A', 'A1', 'A2');
INSERT INTO raw_data VALUES ('B', 'B1', 'B2');

CREATE TABLE foo (
    id int PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
    value varchar NOT NULL
);

CREATE TABLE bar (
    id int PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
    value varchar NOT NULL,
    foo_id int NOT NULL,
    CONSTRAINT fk_bar_foo FOREIGN KEY (foo_id) REFERENCES foo(id)
);

WITH new_foos AS (
    INSERT INTO foo (value)
    SELECT foo FROM raw_data
    RETURNING *
)
INSERT INTO bar (foo_id, value)
SELECT
    f.id,
    unnest(ARRAY[r.bar_1, r.bar_2])
FROM new_foos f
JOIN raw_data r
    ON r.foo = f.value;

It works fine however, from a performance point of view, it seems like a shame to have to go back and re-scan the raw data table. I.e. Once to do the insert into foo and then again for the insert into bar.

I'd be interested in knowing if this is an optimal approach or, if not, what can be done to improve it.

Upvotes: 1

Views: 45

Answers (2)

Belayer
Belayer

Reputation: 14861

Well yea, but thinking about it, if yo have sufficient memory to hold the Json then you have sufficient memory to hold the table. So passing the data twice may be even faster. 1 Pass for disk, and 1 pass from memory. A DBMS tends to retain a the most reticently used data in memory just for this reason. Disclaimer: My main experience is with Oracle so I may be projecting into Postgres here, but I think it does this buffering.

Upvotes: 1

Andy N
Andy N

Reputation: 1304

I've been playing about with it for a bit, and I want to throw this out there as a possible suggestion. What if you used a CTE to put the data into a JSON structure (generating the PKs by hand), then inserted each of your tables from that?

Like this:

WITH raw_as_json AS (
    SELECT
        jsonb_build_object(
            'id', NEXTVAL('foo_seq'),
            'value', foo,
            'bars', json_build_array(bar_1, bar_2)
        ) AS foobar
    FROM raw_data
), foos AS (
    INSERT INTO foo (id, value)
    SELECT
        (foobar -> 'id')::int,
        foobar -> 'value'
    FROM raw_as_json
)
INSERT INTO bar (id, foo_id, value)
SELECT
    NEXTVAL('bar_seq'),
    (foobar -> 'id')::int,
    jsonb_array_elements_text(foobar -> 'bars')
FROM raw_as_json;

It only has to scan the raw data once. I haven't fully weighted up the performance concerns of the CTE, json etc. I'd be appreciative of any criticism of this approach and I'll still hold out for a better (or less weird) answer.

Upvotes: 0

Related Questions