Reputation: 355
I'm running MySQL on my local computer. I've setup a database with multiple tables, 3 of which are potentially interesting here:
CREATE TABLE rnames (
id INT AUTO_INCREMENT PRIMARY KEY,
rname varchar(16) UNIQUE
);
CREATE TABLE fetch_times (
id INT AUTO_INCREMENT PRIMARY KEY,
timestamp BIGINT NOT NULL
);
CREATE TABLE measurements (
id INT AUTO_INCREMENT PRIMARY KEY,
rname_id INT,
coorx real,
coory real,
timestamp BIGINT,
delay integer,
fetch_id integer,
UNIQUE (rname_id, coorx, coory, timestamp),
FOREIGN KEY (rname_id) REFERENCES rnames(id),
FOREIGN KEY (fetch_id) REFERENCES fetch_times(id)
);
rnames
and fetch_times
have 397 and 2 rows respectively, measurements
starts as an empty table and I want to populate it with ~ 5,000,000 rows. To do that, I created a CSV file with all those rows.
Example line from the CSV file
\N,143,14.61944,50.08065,1595053203,-13,\N
Then I ran
LOAD DATA LOCAL INFILE 'filename.csv' INTO TABLE measurements FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
That's taking too long. What's really interesting is that the insertion speed dramatically drops as the table gets larger. I've made some measurements of how long the LOAD DATA
takes given the number of rows in the CSV file.
#rows total execution time (s)
1k 0.08
2k 0.17
5k 0.39
10k 0.68
20k 0.84
50k 1.24
100k 2.00
200k 4.14
500k 22.14
1.0M 55.00
2.0M 129.71
2.1M 130.75
2.2M 163.15
2.3M 186.63
2.4M 206.74
2.5M 247.74
2.6M 264.47
2.7M 321.06
2.8M 374.64
2.9M 456.19
3.0M 524.02
3.1M 664.80
3.2M 746.19
3.3M 918.73
Note that I always ran TRUNCATE measurements;
to empty the table before each of those measurements and each measurement used a new random permutation of the CSV file.
Q: Why do we have that sharp increase in the execution time?
I would understand that the time to insert one more row grows with log
of the size of the table. That would yield approximately n log(n)
time for inserting n
rows into a table that starts empty. The above data and graph however show a considerably faster time increase, I think.
$ mysql --version
mysql Ver 8.0.20-0ubuntu0.20.04.1 for Linux on x86_64 ((Ubuntu))
I can provide any other necessary data.
Many thanks for any help!
EDIT 2020/07/27 15:19 UTC
As the majority of comments seemed to suggest, I created a new database with only one table -- the measurements
table, which then looked like this:
CREATE TABLE measurements (
id INT AUTO_INCREMENT PRIMARY KEY,
rname_id INT,
coorx real,
coory real,
timestamp BIGINT,
delay integer,
fetch_id integer
);
i.e. I removed the UNIQUE
and FOREIGN KEY
constraints.
I ran the same experiment again, getting much higher insertion speeds especially when the table was already getting large.
#rows total execution time, without constraints (s)
1k 0.05
2k 0.10
5k 0.22
10k 0.52
20k 0.58
50k 0.84
100k 1.20
200k 2.04
500k 4.38
1.0M 9.91
2.0M 34.40
2.1M 21.48
2.2M 39.52
2.3M 42.54
2.4M 22.72
2.5M 46.58
2.6M 48.33
2.7M 31.19
2.8M 52.15
2.9M 56.70
3.0M 58.10
3.1M 60.29
This way, I'm able to add my 5 million rows into the table in ~ 40 s. Then I can add the UNIQUE
and FOREIGN KEY
constraints. That itself took ~ 40 s for UNIQUE
and ~ 160 s and ~ 110 s for both the FOREIGN KEY
s. The overall time to create and populate the table as I wanted it is therefore a bit under 6 min. -- much more acceptable than time that was needed before.
Thanks for the suggestion!
The problem itself is solved and I get that inserting to a table is faster when there are fewer constraints. However, I'm still not sure why the insertion time starts growing as rapidly as the first experiment showed since the time to insert a row show should only grow with the log
of the table size.
EDIT 2020/07/28 20:02 UTC
reacting to Rick James' answer
I have 16 GB of RAM on my laptop on which I run the database now.
innodb_buffer_pool_size=134217728
(128 MB, i.e. the MySQL default) until now; I tried increasing that -- see below.
The full measurements
table with all the ~ 5M rows and the constraints I wanted it to have reports 301760512
(~ 288 MB) for Data_length
.
I set innodb_buffer_pool_size
to 2 GB. I tried inserting 2.5M and 3.0M rows into the table with the constraints before and after this change.
2.5M rows before 136.06 s
2.5M rows after 132.64 s
3.0M rows before 458.69 s
3.0M rows after 206.99 s
While there is almost no change for only 2.5M rows, the reduction of execution time for 3.0M rows is pretty large. From that I guess that increasing innodb_buffer_pool_size
makes the execution time grow much less dramatically with the number of rows.
I might try diving up the rows to load into smaller chunks in future, but this seems good enough for now.
Upvotes: 2
Views: 2722
Reputation: 142518
Explanation
How much RAM do you have?
What is the value of innodb_buffer_pool_size
?
How big (in gigabytes) is the table? (SHOW TABLE STATUS
)?
The fact the curve is not linear makes me suspect that it is becoming I/O bound.
Another thing to try is to load it in chunks of, say, 10K rows at a time. The reason for this is that a lot of effort goes into preserving the previous state of the table in order to recover in case of a rollback or crash.
Speed ups
Get rid of fetch_times
. INT
is 4 bytes; TIMESTAMP
is 5 bytes. There is virtually no space saved by normalizing the timestamp, and there is a lot of overhead in maintaining that extra table.
In general, do not normalize "continuous" values -- such as datetime, timestamp, float, etc. SELECTing
on a normalized range can be terrible for performance.
UNIQUE (rname_id, coorx, coory, timestamp),
-- It is often unwise to have floats and times in a UNIQUE
index; you risk a uniqueness error because two things happen to occur in the same second, etc.
Assuming you are I/O-bound, shrinking the table will help:
REAL
, I think, means DOUBLE
, which takes 8 bytes. FLOAT
takes only 4 bytes (8 bytes for the lat/lng). Sailing off the cost of Yemen? FLOAT
gives a resolution of 1.7m / 5.6 feet. DECIMAL(8,6)/(9,6)
gives 16cm / 6 inch resolution in only 1 byte more (9 bytes for the pair). DECIMAL(8,5)/(9,5)
matches your input but also take 9 bytes. http://mysql.rjweb.org/doc.php/latlng#representation_choices
INTEGER
is the same as INT
; they take 4 bytes. Consider MEDIUMINT
- 3 bytes and range of +/-8M. See also MEDIUMINT UNSIGNED
, SMALLINT
, etc.
Unnormalize
In order to get rid of fetch_id, you would need to load the CSV into a temporary table, then do a second step to copy the rows into the 'real' table. The load could be very fast because it would not need secondary indexes or FKs. Chunking, in this step, would probably not be necessary.
The second step is to do a INSERT ... SELECT
with JOIN fetchtimes
to move the data into the main table. Chunking might be necessary, and easy to do if the temp table has an auto_increment PK.
Upvotes: 2