mpe
mpe

Reputation: 1000

How can I improve LOAD DATA performance on large InnoDB tables?

I have this table with more than 7 million rows and I am LOAD DATA LOCAL INFILE'ing more data in the order of 0.5 million rows at a time into it. The first few times were fast, but this addition is taking increasingly long, probably due to indexing overhead:

CREATE TABLE `orthograph_ests` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `digest` char(32) NOT NULL,
  `taxid` int(10) unsigned NOT NULL,
  `date` int(10) unsigned DEFAULT NULL,
  `header` varchar(255) NOT NULL,
  `sequence` mediumblob,
  PRIMARY KEY (`id`),
  UNIQUE KEY `digest` (`digest`),
  KEY `taxid` (`taxid`),
  KEY `header` (`header`)
) ENGINE=InnoDB AUTO_INCREMENT=12134266 DEFAULT CHARSET=latin1

I am developing an application that will run on pre-existing databases. I most likely have no control over server variables unless I make changes to them mandatory (which I would prefer not to), so I'm afraid suggestions like these are of limited use.

I have read that minimizing keys on this table will help. However, I need those keys for later queries. I'm guessing that if I drop and re-create them would take very long as well, but I have not tested this. I have also read that especially the UNIQUE constraint makes the insertion slow. The digest column will take SHA256 digests that must be unique, and I can't make sure there is no collision (very unlikely, I know, but possible).

Would partitioning help, as suggested here? Could I improve the indexing, e.g., by limiting the key length on the digest column? Should I change to MyISAM, which supports DISABLE KEYS during transcactions? What else could I do to improve LOAD DATA performance?

Edit:

After the large insertion, this table is used for SELECTs only, no more writes. This large loading is mostly a once-and-done operation, however about 1,000 datasets (of each 0.5M rows) need to be uploaded before this is finished.

I will be using the digest to look up rows, which is why I indexed that column. If there should be a collision, that individual row should not be uploaded.

Putting the sequence blob in an external file system is probably not a viable option since I cannot easily impose file system changes on the users.

Upvotes: 1

Views: 6780

Answers (2)

jeremycole
jeremycole

Reputation: 2761

There are essentially two reasons for data loading to be slow:

  1. Write performance in inserting the data itself.
  2. Read performance reading existing data in order to load existing data in order to modify existing pages while adding new data.

The write performance issues can be addressed by primarily reducing durability and reducing logging. That's what many of the suggestions you'll find, such as the following settings: innodb_flush_log_at_trx_commit=0, innodb_doublewrite=0, innodb_support_xa=0. Reducing the amount of data written can be helpful as well, such as by setting log-bin=0. It also is not really acceptable to expect your customers to change these durability related settings in existing production systems, though. Changing them is more suited to one-shot bulk loading on dedicated systems than periodically adding data to existing systems.

Bulk-loading INSERT (with many rows) or LOAD DATA INFILE seek to reduce write throughput requirements by writing more data in a single transaction, thus reducing the number of syncs to the transaction logs. Reducing write throughput or increasing write performance can only help to an extent though.

It can also generally be helpful to sort the data by PRIMARY KEY before loading, in order to reduce the amount of unnecessary page splits when inserting data into the index structures. However that is of limited use when there are multiple secondary keys, since sorting by PRIMARY KEY will necessarily mean that the data is not sorted by at least one secondary key.

The read performance issues are perhaps more interesting, and are frequently the actual performance problem with loading new data into existing tables, especially where secondary keys are present. The best possible thing would be that all of your existing data fits into memory (innodb_buffer_pool_size is large enough) such that paging data in and out of the cache is not necessary during loading. Given you're talking about only a few million rows, this may be feasible.

Upvotes: 1

O. Jones
O. Jones

Reputation: 108641

This is indeed a large amount of data you are loading, and you should expect it to take many dozens of hours, especially on general purpose shared server hardware. There's very little magic (unless you work at Google or something) that will make this job anything but a big pain in the neck. So have courage.

It's a reference table. That means you should immediately switch to MyISAM and stay there for this table. You don't need InnoDB's transactional integrity features, but you do need MyISAM to disable indexing during loading and re-enable it afterward. Re-enabling indexing will take a long time, so be prepared for that.

You should consider using a shorter hash than SHA-256. SHA-1 (160 bits) is good. Believe it or not, MD-5 (128 bits) may also serve. MD-5 has been cracked so it's not suitable for secure content authentication. But it's still a useful hash. A shorter hash is a better hash from your perspective.

If you can disable indexing MyISAM style, it probably doesn't matter much whether your digest key is unique. But you might consider allowing it to be non-unique to save time.

It's hard to make a suggestion about partitioning without knowing more about your data and your server hardware. But considering this is a reference database, it seems like it might be wise just to bite the bullet for a couple of weeks and get it loaded.

If you have plenty of server disk space, you might consider loading each half-megarow chunk into its own table, then inserting it into the big table. That might prove to be a good way to deal with the possibility that you might have to reload the whole thing some day.

On shared server hardware, it might make sense to use smaller chunks than half a megarow.

You might consider making a separate id / digest table. Then you could load your data without the digests and get it done quickly. Then you could write yourself a stored procedure or client that would create the digests in batches of a few thousand rows each until they were done. This only works if the stuff being digested is in your dataset.

Upvotes: 2

Related Questions