user45867
user45867

Reputation: 983

Improving Speed of SQL 'Update' function - break into Insert/ Delete?

I'm running an ETL process and streaming data into a MySQL table.

Now it is being written over a web connection (fairly fast one) -- so that can be a bottleneck.

Anyway, it's a basic insert/ update function. It's a list of IDs as the primary key/ index .... and then a few attributes.

If a new ID is found, insert, otherwise, update ... you get the idea.

Currently doing an "update, else insert" function based on the ID (indexed) is taking 13 rows/ second (which seems pretty abysmal, right?). This is comparing 1000 rows to a database of 250k records, for context.

When doing a "pure" insert everything approach, for comparison, already speeds up the process to 26 rows/ second.

The thing with the pure "insert" approach is that I can have 20 parallel connections "inserting" at once ... (20 is max allowed by web host) ... whereas any "update" function cannot have any parallels running.

Thus 26 x 20 = 520 r/s. Quite greater than 13 r/s, especially if I can rig something up that allows even more data pushed through in parallel.

My question is ... given the massive benefit of inserting vs. updating, is there a way to duplicate the 'update' functionality (I only want the most recent insert of a given ID to survive) .... by doing a massive insert, then running a delete function after the fact, that deletes duplicate IDs that aren't the 'newest' ?

Is this something easy to implement, or something that comes up often?

What else I can do to ensure this update process is faster? I know getting rid of the 'web connection' between the ETL tool and DB is a start, but what else? This seems like it would be a fairly common problem.

Ultimately there are 20 columns, max of probably varchar(50) ... should I be getting a lot more than 13 rows processed/ second?

Upvotes: 0

Views: 267

Answers (2)

Ron Dunn
Ron Dunn

Reputation: 3078

Do it in the DBMS, and wrap it in a transaction.

To explain:

  1. Load your data into a temporary table in MySQL in the fastest way possible. Bulk load, insert, do whatever works. Look at "load data infile".

  2. Outer-join the temporary table to the target table, and INSERT those rows where the PK column of the target table is NULL.

  3. Outer-join the temporary table to the target table, and UPDATE those rows where the PK column of the target table is NOT NULL.

Wrap steps 2 and 3 in a begin/commit (or [start transaction]/commit pair for a transaction. The default behaviour is probably autocommit, which will mean you're doing a LOT of database work after every insert/update. Use transactions properly, and the work is only done once for each block.

Upvotes: 1

Rick James
Rick James

Reputation: 142453

There are many possible 'answers' to your questions.

13/second -- a lot that can be done...

INSERT ... ON DUPLICATE KEY UPDATE ... ('IODKU') is usually the best way to do "update, else insert" (unless I don't know what you mean by it).

Batched inserts is much faster than inserting one row at a time. Optimal is around 100 rows giving 10x speedup. IODKU can (usually) be batched, too; see the VALUES() pseudo function.

BEGIN;...lots of writes...COMMIT; cuts back significantly on the overhead for transaction.

Using a "staging" table for gathering things up update can have a significant benefit. My blog discussing that. That also covers batch "normalization".

Building Summary Tables on the fly interferes with high speed data ingestion. Another blog covers Summary tables.

Normalization can be used for de-dupping, hence shrinking the disk footprint. This can be important for decreasing I/O for the 'Fact' table in Data Warehousing. (I am referring to your 20 x VARCHAR(50).)

RAID striping is a hardware help.

Batter-Backed-Write-Cache on a RAID controller makes writes seem instantaneous.

SSDs speed up I/O.

If you provide some more specifics (SHOW CREATE TABLE, SQL, etc), I can be more specific.

Upvotes: 1

Related Questions