mimo31
mimo31

Reputation: 355

Loading Lots of Data into MySQL too Slow

I'm running MySQL on my local computer. I've setup a database with multiple tables, 3 of which are potentially interesting here:

CREATE TABLE rnames (
    id INT AUTO_INCREMENT PRIMARY KEY,
    rname varchar(16) UNIQUE
);
CREATE TABLE fetch_times (
    id INT AUTO_INCREMENT PRIMARY KEY,
    timestamp BIGINT NOT NULL
);
CREATE TABLE measurements (
    id INT AUTO_INCREMENT PRIMARY KEY,
    rname_id INT,
    coorx real,
    coory real,
    timestamp BIGINT,
    delay integer,
    fetch_id integer,
    UNIQUE (rname_id, coorx, coory, timestamp),
    FOREIGN KEY (rname_id) REFERENCES rnames(id),
    FOREIGN KEY (fetch_id) REFERENCES fetch_times(id)
);

rnames and fetch_times have 397 and 2 rows respectively, measurements starts as an empty table and I want to populate it with ~ 5,000,000 rows. To do that, I created a CSV file with all those rows.

Example line from the CSV file

\N,143,14.61944,50.08065,1595053203,-13,\N

Then I ran

LOAD DATA LOCAL INFILE 'filename.csv' INTO TABLE measurements FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';

That's taking too long. What's really interesting is that the insertion speed dramatically drops as the table gets larger. I've made some measurements of how long the LOAD DATA takes given the number of rows in the CSV file.

#rows   total execution time (s)
1k      0.08
2k      0.17
5k      0.39
10k     0.68
20k     0.84
50k     1.24
100k    2.00
200k    4.14
500k    22.14
1.0M    55.00
2.0M    129.71
2.1M    130.75
2.2M    163.15
2.3M    186.63
2.4M    206.74
2.5M    247.74
2.6M    264.47
2.7M    321.06
2.8M    374.64
2.9M    456.19
3.0M    524.02
3.1M    664.80
3.2M    746.19
3.3M    918.73

graph of the data directly above

Note that I always ran TRUNCATE measurements; to empty the table before each of those measurements and each measurement used a new random permutation of the CSV file.

Q: Why do we have that sharp increase in the execution time?

I would understand that the time to insert one more row grows with log of the size of the table. That would yield approximately n log(n) time for inserting n rows into a table that starts empty. The above data and graph however show a considerably faster time increase, I think.

$ mysql --version
mysql  Ver 8.0.20-0ubuntu0.20.04.1 for Linux on x86_64 ((Ubuntu))

I can provide any other necessary data.

Many thanks for any help!

EDIT 2020/07/27 15:19 UTC

As the majority of comments seemed to suggest, I created a new database with only one table -- the measurements table, which then looked like this:

CREATE TABLE measurements (
    id INT AUTO_INCREMENT PRIMARY KEY,
    rname_id INT,
    coorx real,
    coory real,
    timestamp BIGINT,
    delay integer,
    fetch_id integer
);

i.e. I removed the UNIQUE and FOREIGN KEY constraints.

I ran the same experiment again, getting much higher insertion speeds especially when the table was already getting large.

#rows   total execution time, without constraints (s)
1k      0.05
2k      0.10
5k      0.22
10k     0.52
20k     0.58
50k     0.84
100k    1.20
200k    2.04
500k    4.38
1.0M    9.91
2.0M    34.40
2.1M    21.48
2.2M    39.52
2.3M    42.54
2.4M    22.72
2.5M    46.58
2.6M    48.33
2.7M    31.19
2.8M    52.15
2.9M    56.70
3.0M    58.10
3.1M    60.29

graph of the data above (execution time for table with removed constraints)

This way, I'm able to add my 5 million rows into the table in ~ 40 s. Then I can add the UNIQUE and FOREIGN KEY constraints. That itself took ~ 40 s for UNIQUE and ~ 160 s and ~ 110 s for both the FOREIGN KEYs. The overall time to create and populate the table as I wanted it is therefore a bit under 6 min. -- much more acceptable than time that was needed before.

Thanks for the suggestion!

The problem itself is solved and I get that inserting to a table is faster when there are fewer constraints. However, I'm still not sure why the insertion time starts growing as rapidly as the first experiment showed since the time to insert a row show should only grow with the log of the table size.

EDIT 2020/07/28 20:02 UTC

reacting to Rick James' answer

I have 16 GB of RAM on my laptop on which I run the database now.

innodb_buffer_pool_size=134217728 (128 MB, i.e. the MySQL default) until now; I tried increasing that -- see below.

The full measurements table with all the ~ 5M rows and the constraints I wanted it to have reports 301760512 (~ 288 MB) for Data_length.

I set innodb_buffer_pool_size to 2 GB. I tried inserting 2.5M and 3.0M rows into the table with the constraints before and after this change.

2.5M rows before    136.06 s
2.5M rows after     132.64 s
3.0M rows before    458.69 s
3.0M rows after     206.99 s

While there is almost no change for only 2.5M rows, the reduction of execution time for 3.0M rows is pretty large. From that I guess that increasing innodb_buffer_pool_size makes the execution time grow much less dramatically with the number of rows.

I might try diving up the rows to load into smaller chunks in future, but this seems good enough for now.

Upvotes: 2

Views: 2722

Answers (1)

Rick James
Rick James

Reputation: 142518

Explanation

How much RAM do you have?
What is the value of innodb_buffer_pool_size?
How big (in gigabytes) is the table? (SHOW TABLE STATUS)?

The fact the curve is not linear makes me suspect that it is becoming I/O bound.

Another thing to try is to load it in chunks of, say, 10K rows at a time. The reason for this is that a lot of effort goes into preserving the previous state of the table in order to recover in case of a rollback or crash.

Speed ups

Get rid of fetch_times. INT is 4 bytes; TIMESTAMP is 5 bytes. There is virtually no space saved by normalizing the timestamp, and there is a lot of overhead in maintaining that extra table.

In general, do not normalize "continuous" values -- such as datetime, timestamp, float, etc. SELECTing on a normalized range can be terrible for performance.

UNIQUE (rname_id, coorx, coory, timestamp), -- It is often unwise to have floats and times in a UNIQUE index; you risk a uniqueness error because two things happen to occur in the same second, etc.

Assuming you are I/O-bound, shrinking the table will help:

REAL, I think, means DOUBLE, which takes 8 bytes. FLOAT takes only 4 bytes (8 bytes for the lat/lng). Sailing off the cost of Yemen? FLOAT gives a resolution of 1.7m / 5.6 feet. DECIMAL(8,6)/(9,6) gives 16cm / 6 inch resolution in only 1 byte more (9 bytes for the pair). DECIMAL(8,5)/(9,5) matches your input but also take 9 bytes. http://mysql.rjweb.org/doc.php/latlng#representation_choices

INTEGER is the same as INT; they take 4 bytes. Consider MEDIUMINT - 3 bytes and range of +/-8M. See also MEDIUMINT UNSIGNED, SMALLINT, etc.

Unnormalize

In order to get rid of fetch_id, you would need to load the CSV into a temporary table, then do a second step to copy the rows into the 'real' table. The load could be very fast because it would not need secondary indexes or FKs. Chunking, in this step, would probably not be necessary.

The second step is to do a INSERT ... SELECT with JOIN fetchtimes to move the data into the main table. Chunking might be necessary, and easy to do if the temp table has an auto_increment PK.

Upvotes: 2

Related Questions