SQL performance: Inserting one table into two

Question

During my career, I've come across many instance of having to insert flat, denormalized data into a normalised structure.

To accomplish this I've often used CTE inserts. E.g.

CREATE TABLE raw_data (
    foo varchar,
    bar_1 varchar,
    bar_2 varchar
);

INSERT INTO raw_data VALUES ('A', 'A1', 'A2');
INSERT INTO raw_data VALUES ('B', 'B1', 'B2');

CREATE TABLE foo (
    id int PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
    value varchar NOT NULL
);

CREATE TABLE bar (
    id int PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
    value varchar NOT NULL,
    foo_id int NOT NULL,
    CONSTRAINT fk_bar_foo FOREIGN KEY (foo_id) REFERENCES foo(id)
);

WITH new_foos AS (
    INSERT INTO foo (value)
    SELECT foo FROM raw_data
    RETURNING *
)
INSERT INTO bar (foo_id, value)
SELECT
    f.id,
    unnest(ARRAY[r.bar_1, r.bar_2])
FROM new_foos f
JOIN raw_data r
    ON r.foo = f.value;

It works fine however, from a performance point of view, it seems like a shame to have to go back and re-scan the raw data table. I.e. Once to do the insert into foo and then again for the insert into bar.

I'd be interested in knowing if this is an optimal approach or, if not, what can be done to improve it.

Belayer · Accepted Answer

Well yea, but thinking about it, if yo have sufficient memory to hold the Json then you have sufficient memory to hold the table. So passing the data twice may be even faster. 1 Pass for disk, and 1 pass from memory. A DBMS tends to retain a the most reticently used data in memory just for this reason. Disclaimer: My main experience is with Oracle so I may be projecting into Postgres here, but I think it does this buffering.

SQL performance: Inserting one table into two

Answers (2)

Related Questions