Bob Muller
Bob Muller

Reputation: 55

MySQL inserts very slow into MEMORY table

I am trying to optimize a large (68K row) insert into a staging table. I created the table as a MEMORY engine table, and there are no indexes or foreign keys at all. When my ETL process starts inserting, the inserts execute, but very slowly; the full load takes over an hour.

Here's the table definition from show table create:

CREATE TABLE `pub_tair_germplasm` (
  `germplasm_id` int(12) DEFAULT NULL,
  `name` varchar(100) DEFAULT NULL,
  `original_name` varchar(100) DEFAULT NULL,
  `sp_growth_conditions` varchar(2048) DEFAULT NULL,
  `description` varchar(2048) DEFAULT NULL,
  `description_uc` varchar(2048) DEFAULT NULL,
  `is_mutant` varchar(1) DEFAULT NULL,
  `is_aneuploid` varchar(1) DEFAULT NULL,
  `ploidy` varchar(4) DEFAULT NULL,
  `species_variant_id` int(12) DEFAULT NULL,
  `taxon_id` int(12) DEFAULT NULL,
  `aneuploid_chromosome` int(10) DEFAULT NULL,
  `date_entered` date DEFAULT NULL,
  `date_last_modified` date DEFAULT NULL,
  `tair_object_id` bigint(19) DEFAULT NULL,
  `is_obsolete` varchar(1) DEFAULT NULL,
  `tair_object_type_id` int(12) DEFAULT NULL,
  `germplasm_type` varchar(20) DEFAULT NULL
) ENGINE=MEMORY DEFAULT CHARSET=latin1

Here's the insert:

INSERT INTO pub_tair_germplasm(
   germplasm_id,
   name,
   original_name,
   sp_growth_conditions,
   description,
   description_uc,
   is_mutant,
   is_aneuploid,
   ploidy,
   species_variant_id,
   taxon_id,
   aneuploid_chromosome,
   date_entered,
   date_last_modified,
   tair_object_id,
   is_obsolete,
   tair_object_type_id,
   germplasm_type)
VALUES (
   $germplasm_id,
   $name,
   $original_name,
   $sp_growth_conditions,
   $description,
   $description_uc,
   CASE $is_mutant WHEN 'F' THEN 'n' WHEN 'T' THEN 'y' ELSE 'y' END,
   CASE $is_aneuploid WHEN 'F' THEN 'n' WHEN 'T' THEN 'y' ELSE 'y' END,
   $ploidy,
   $species_variant_id, 
   $taxon_id,
   $aneuploid_chromosome,
   $date_entered,
   $date_last_modified,
   $tair_object_id,
   $is_obsolete,
   $tair_object_type_id,
   $type)

This is being done through Clover/ETL, which normally inserts very quickly, using JDBC batching with a batch size of 5000. The value variables are CloverETL variable references. Similar inserts on Oracle take seconds into a regular table. This is all done in a single transaction, no commits until all rows are inserted (app requirement).

While the inserts are running, top shows that both CPUs are 0.3% utilized.

EDITS:

For the next test run, I increased the max heap table size to 1GB, easily enough to hold the entire table:

mysql> select @@max_heap_table_size;
+-----------------------+
| @@max_heap_table_size |
+-----------------------+
|             999999488 |
+-----------------------+

Process list at start:

mysql> SHOW FULL PROCESSLIST;
+----+------+-----------+-------+---------+------+-------+-----------------------+
| Id | User | Host      | db    | Command | Time | State | Info                  |
+----+------+-----------+-------+---------+------+-------+-----------------------+
|  3 | root | localhost | mysql | Query   |    0 | NULL  | SHOW FULL PROCESSLIST |
+----+------+-----------+-------+---------+------+-------+-----------------------+
1 row in set (0.00 sec)

Process list during run:

mysql> SHOW FULL PROCESSLIST;
+----+---------+--------------------------------------------+-------+---------+------+-------+-----------------------+
| Id | User    | Host                                       | db    | Command | Time | State | Info                  |
+----+---------+--------------------------------------------+-------+---------+------+-------+-----------------------+
|  4 | pubuser | c-67-188-135-136.hsd1.ca.comcast.net:55928 | pub   | Sleep   |    0 |       | NULL                  |
|  5 | root    | localhost                                  | mysql | Query   |    0 | NULL  | SHOW FULL PROCESSLIST |
+----+---------+--------------------------------------------+-------+---------+------+-------+-----------------------+
2 rows in set (0.00 sec)

I enabled the general log file; it shows the CloverETL environment setup commands issued, then goes into the series of inserts:

150528 20:22:54     4 Connect   [email protected] on pub
                    4 Query     /* mysql-connector-java-5.1.20 ( Revision: [email protected] ) */SHOW VARIABLES WHERE Variable_name ='langua
ge' OR Variable_name = 'net_write_timeout' OR Variable_name = 'interactive_timeout' OR Variable_name = 'wait_timeout' OR Variable_name = 'character_set_client' OR Variable_name 
= 'character_set_connection' OR Variable_name = 'character_set' OR Variable_name = 'character_set_server' OR Variable_name = 'tx_isolation' OR Variable_name = 'transaction_isola
tion' OR Variable_name = 'character_set_results' OR Variable_name = 'timezone' OR Variable_name = 'time_zone' OR Variable_name = 'system_time_zone' OR Variable_name = 'lower_cas
e_table_names' OR Variable_name = 'max_allowed_packet' OR Variable_name = 'net_buffer_length' OR Variable_name = 'sql_mode' OR Variable_name = 'query_cache_type' OR Variable_nam
e = 'query_cache_size' OR Variable_name = 'init_connect'
                    4 Query     /* mysql-connector-java-5.1.20 ( Revision: [email protected] ) */SELECT @@session.auto_increment_increment
                    4 Query     SHOW COLLATION
150528 20:22:55     4 Query     SET NAMES latin1
                    4 Query     SET character_set_results = NULL
                    4 Query     SET autocommit=1
                    4 Query     SET sql_mode='STRICT_TRANS_TABLES'
                    4 Query     SET autocommit=0
                    4 Query     SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED
150528 20:23:08     4 Query     INSERT INTO pub_tair_germplasm(
   germplasm_id,
   name,
   original_name,
   sp_growth_conditions,
   description,
   description_uc,
   is_mutant,
   is_aneuploid,
   ploidy,
   species_variant_id,
   taxon_id,
   aneuploid_chromosome,
   date_entered,
   date_last_modified,
   tair_object_id,
   is_obsolete,
   tair_object_type_id,
   germplasm_type)
VALUES (
   500689369,
   'CS2000002',
   'CS2000002',
   'none',
   'Sequence-indexed T-DNA insertion line; from the GABI-Kat project (German Plant Genomics Program - Koelner Arabidopsis T-DNA lines); generated using flanking sequence tags (F
STs) in the Columbia (Col-0) background; genomic DNA was isolated from T1 plants; plant sequences adjacent to T-DNA borders were amplified by adapter-ligation PCR; automated pur
ification and sequencing of PCR product were conducted followed by computational trimming of the resulting sequence files; for details, see the GABI-Kat web site: http://www.gab
i-kat.de; this is a T4 generation single-plant line potentially homozygous for the insertion. May be segregating for phenotypes that are not linked to the insertion; may have ad
ditional insertions potentially segregating.',
   'SEQUENCE-INDEXED T-DNA INSERTION LINE; FROM THE GABI-KAT PROJECT (GERMAN PLANT GENOMICS PROGRAM - KOELNER ARABIDOPSIS T-DNA LINES); GENERATED USING FLANKING SEQUENCE TAGS (F
STS) IN THE COLUMBIA (COL-0) BACKGROUND; GENOMIC DNA WAS ISOLATED FROM T1 PLANTS; PLANT SEQUENCES ADJACENT TO T-DNA BORDERS WERE AMPLIFIED BY ADAPTER-LIGATION PCR; AUTOMATED PUR
IFICATION AND SEQUENCING OF PCR PRODUCT WERE CONDUCTED FOLLOWED BY COMPUTATIONAL TRIMMING OF THE RESULTING SEQUENCE FILES; FOR DETAILS, SEE THE GABI-KAT WEB SITE: HTTP://WWW.GAB
I-KAT.DE; THIS IS A T4 GENERATION SINGLE-PLANT LINE POTENTIALLY HOMOZYGOUS FOR THE INSERTION. MAY BE SEGREGATING FOR PHENOTYPES THAT ARE NOT LINKED TO THE INSERTION; MAY HAVE AD
DITIONAL INSERTIONS POTENTIALLY SEGREGATING.',
   CASE null WHEN 'F' THEN 'n' WHEN 'T' THEN 'y' ELSE 'y' END,
   CASE 'F' WHEN 'F' THEN 'n' WHEN 'T' THEN 'y' ELSE 'y' END,
   '2',
   null, 
   1,
   null,
   '2015-01-06 10:49:21',
   '2015-01-06 10:40:55',
   6530679980,
   'F',
   200016,
   'individual_line')

The problem remains unchanged.

Upvotes: 1

Views: 2060

Answers (3)

JanS
JanS

Reputation: 31

Broadly speaking, the batch size with 5000 statements is considered too big. The actual optimal JDBC batch size depends on multiple criteria. However, numbers between 50 and 100 should be okay. You may try even bigger numbers of course and try to find the right number for you. I would definitely play a little bit with the number and check how the performance changes with different numbers.

Regarding the memory requirements (which is also one of the factors that comes into play when using batch size), it seems that based on the CREATE statement the maximum size of a row is 6418 bytes. So if you have approximately 68k of these rows (counting with the maximum size of such a row), the maximum required memory for this task is around 450GB. That is, however, not even a half of max_heap_table_size that you have defined. So there shouldn’t be a problem if the table was empty. Moreover, if you reached the limit of the MEMORY table the CloverETL graph would fail on appropriate SQLException (java.sql.SQLException: The table <YOUR_MEMORY_TABLE>' is full). It doesn‘t seem to be that case though.

Another option that comes to my mind (and is again related to aforementioned notes) is that you actually don‘t have enough available memory to store your data in memory and so the data is swapped on the disk. This might clarify the performance slowdown. If you, however, run out of all the available virtual memory (RAM and swap), you would most likely run into another issue with different symptoms – so it wouldn’t be that case.

Maybe, if nothing above helps, you could share with us the log from the CloverETL graph execution (executed in DEBUG log level if possible) so that we can see if there is anything suspicious. Apart that, commercial version of CloverETL comes also with MySQLDataWriter that uses MySQL native client and possibly might resolve issues related to JDBC. I would start with the find the right batch size though.

Upvotes: 0

Bob Muller
Bob Muller

Reputation: 55

Well, I don't know what the problem was specifically, but uploading a tab-delimited version of the data to the mysql server and doing this:

LOAD DATA LOCAL INFILE '/tmp/pub_tair_grm_insert.csv' INTO TABLE pub_tair_germplasm;
Query OK, 68932 rows affected, 65535 warnings (1.26 sec)
Records: 68932  Deleted: 0  Skipped: 0  Warnings: 6

is clearly the answer, whatever the question. There must be something in the JDBC batching being done by Clover/ETL that is slowing inserts dramatically. I'll look into that if I get a chance, but for now the LOAD gives me what I need.

Upvotes: 1

spencer7593
spencer7593

Reputation: 108430

I suspect that you are attempting to exceed the maximum size allowed for a table using the MEMORY storage engine. But I don't understand why an error isn't being returned by MySQL, or Clover/ETL isn't returning an error, or timing out.

I recommend you gather output from

SHOW FULL PROCESSLIST

To see the state of the current sessions. Maybe an INSERT statement is "hanging"?

You could also temporarily enable the general log

SET GLOBAL general_log = 1

And then try the load. That log file can grow really big really fast, so you want to be sure to set that back to 0 to disable it.

As I noted in my comment, I'm calculating a size of over 415MB for the table. That's based on using the MEMORY engine (fixed length rows), a row size of 6407 bytes (assuming a single byte characterset for the character columns), not including row overhead for null indicators.

Excerpt from the MySQL Reference Manual, section 15.3 The MEMORY Storage Engine

The maximum size of MEMORY tables is limited by the max_heap_table_size system variable, which has a default value of 16MB. To enforce different size limits for MEMORY tables, change the value of this variable. The value in effect for CREATE TABLE, or a subsequent ALTER TABLE or TRUNCATE TABLE, is the value used for the life of the table. A server restart also sets the maximum size of existing MEMORY tables to the global max_heap_table_size value. You can set the size for individual tables as described later in this section.

Reference: https://dev.mysql.com/doc/refman/5.6/en/memory-storage-engine.html


Also, we note that the MEMORY storage engine does not support transactions.

If you have a requirement that no other sessions see the data while it's being loaded, you could consider using InnoDB or MyISAM, and loading to a different table name, and then renaming the table as soon as the load is completed.

Or, use InnoDB storage engine and a hughjass transaction. Do all the INSERT operations as part of a single transaction. (But, I'm not a fan of that approach; I'm afraid of generating a whole bloatload of rollback.)

Or, use MyISAM storage engine. Take a LOCK on the table, do all the INSERT operations, and then release the lock. (But, any other statements attempting to reference the table would "hang" waiting for the lock to be released.)

Upvotes: 0

Related Questions