Renato Dinhani
Renato Dinhani

Reputation: 36726

Which is the fastest way to insert data in PostgreSQL in a N:M table?

I have an application that need to insert a lot of rows in database, but the insertion is too slow at the moment, I want to know what are the best techniques to fast up the insertion to the maximum.

The situation is in each insertion, I need to discover the id of the token and for this, I put the SELECT inside the insertion.

Which can be the best solution for speeding up to maximum the insertion in textBlockHasToken table including SQL and Java?

Edit: It works at the moment, but I have a large data flow. Some numbers, 100.000 different rows in token table and 2 million rows in textBlockHasToken.

I have these three tables and two insert SQLs:

-- Text blocks of the pages
CREATE TABLE textBlock(
    id                 INTEGER    PRIMARY KEY,
    pageId             INTEGER    REFERENCES page,
    text               VARCHAR    NOT NULL, 
    position           INTEGER    NOT NULL  
);

-- Tokens in the text blocks
CREATE TABLE token(
    id                 INTEGER    PRIMARY KEY,  
    text               VARCHAR    NOT NULL, 
    charType           VARCHAR    NOT NULL,
    grammar            VARCHAR    NOT NULL,
    category           VARCHAR[]  NOT NULL, 
    stopWord           BOOLEAN    NOT NULL    DEFAULT false,
    UNIQUE(text)
);

-- N:N relationship between textblock and token 
CREATE TABLE textBlockHasToken(
    textBlockId        INTEGER    REFERENCES textBlock NOT NULL,
    tokenId            INTEGER    REFERENCES token NOT NULL,
    sentence           INTEGER    NOT NULL,
    position           INTEGER    NOT NULL
);

-- Insert the token
INSERT INTO token(id,text,charType,grammar,category)
VALUES(nextval('tokenIdSqc'),?,?,?,?);

-- Insert in relationship in N:M
INSERT INTO textBlockHasToken(textblockId,tokenId,sentence,position)
VALUES(?,(SELECT id FROM token WHERE text = ?),?,?);

Upvotes: 1

Views: 1164

Answers (4)

wildplasser
wildplasser

Reputation: 44250

  • don't use reserved words as indentifier (text,date)
  • (please don't use MixedCaseIdentifiers)
  • your table definition DOES NOT WORK. The one below does work.

-- Text blocks of the pages

DROP TABLE tmp.textblock CASCADE;
CREATE TABLE tmp.textblock
    ( id                 INTEGER    PRIMARY KEY
    , pageid             INTEGER    -- REFERENCES page
    , ztext               VARCHAR    NOT NULL
    , position           INTEGER    NOT NULL
    );

-- Tokens in the text blocks
DROP TABLE tmp.token CASCADE;
DROP SEQUENCE tmp.tokenidsqc;
CREATE SEQUENCE tmp.tokenidsqc;

CREATE TABLE tmp.token
    ( id                 INTEGER    PRIMARY KEY DEFAULT nextval('tmp.tokenidsqc')
    , ztext               VARCHAR    NOT NULL
    , chartype           VARCHAR    NOT NULL
    , grammar            VARCHAR    NOT NULL
    , category           VARCHAR  NOT NULL
    , stopword           BOOLEAN    NOT NULL    DEFAULT false
    , UNIQUE(ztext)
    );

-- N:N relationship between textblock and token 
DROP TABLE tmp.textblockhastoken CASCADE;
CREATE TABLE tmp.textblockhastoken
    ( textblockid        INTEGER    NOT NULL REFERENCES tmp.textblock(id)
    , tokenid            INTEGER    NOT NULL REFERENCES tmp.token(id)
    , sentence           INTEGER    NOT NULL
    , position           INTEGER    NOT NULL
    );

-- Insert the token
INSERT INTO tmp.token(ztext,chartype,grammar,category)
VALUES('foo' , 'T' , 'a', 'a' ), ('bar' , 'T' , 'a', 'a' );

SELECT * FROM tmp.token;

Upvotes: 3

A.H.
A.H.

Reputation: 66263

duffymo has already mentioned batch-updates. Just to be sure I would add:

  • Do you have experimented / measured with different sizes of one transaction?
  • Do you already use and reuse prepared statements?
  • Have you experimented / measured with "INSERT INTO token ... RETURNING id" and feeding that directly into the second insert statement?

But without knowing anything about the existing code this is all very much crystal ball guessing.

Upvotes: 1

zie
zie

Reputation: 730

I totally agree with duffymo, but I would also add, Postgres's COPY feature is the tastes way to import data..

Of course it's not always feasible to do so, especially from within code. which is why I also agree with duffymo, if you NEED to do it from code on a separate machine, do what duffymo said.

Upvotes: 1

duffymo
duffymo

Reputation: 308958

The best you can do is to make sure that all WHERE variables have indexes associated with them. They're automatically created for primary keys; make sure foreign key columns have indexes as well.

The other consideration is batching requests. Don't do a network roundtrip for each INSERT; batch them together in chunks and commit each chunk. It'll mean fewer network round trips and manageable transaction logs as well.

Upvotes: 5

Related Questions