Reputation: 163
About 2 months ago, I imported EnWikipedia data(http://dumps.wikimedia.org/enwiki/20120211/) into mysql.
After finished importing EnWikipedia data, I have been creating index in the tables of the EnWikipedia database in mysql for about 2 month.
Now, I have reached the point of creating index in "pagelinks".
However, it seems to take an infinite time to pass that point.
Therefore, I checked the time remaining to pass to ensure that my intuition was correct or not.
As a result, the expected time remaining was 60 days(assuming that I create index in "pagelinks" again from the beginning.)
My EnWikipedia database has 7 tables:
"categorylinks"(records: 60 mil, size: 23.5 GiB),
"langlinks"(records: 15 mil, size: 1.5 GiB),
"page"(records: 26 mil, size 4.9 GiB),
"pagelinks"(records: 630 mil, size: 56.4 GiB),
"redirect"(records: 6 mil, size: 327.8 MiB),
"revision"(records: 26 mil, size: 4.6 GiB) and "text"(records: 26 mil, size: 60.8 GiB).
My server is... Linux version 2.6.32-5-amd64 (Debian 2.6.32-39),Memory 16GB, 2.39Ghz Intel 4 core
Is that common phenomenon for creating index to take so long days ? Does anyone have a good solution to create index more quickly ?
Thanks in advance !
P.S: I made following operations for checking the time remaining.
References(Sorry,following page is written in Japanese): http://d.hatena.ne.jp/sh2/20110615
1st. I got records in "pagelink".
mysql> select count(*) from pagelinks;
+-----------+
| count(*) |
+-----------+
| 632047759 |
+-----------+
1 row in set (1 hour 25 min 26.18 sec)
2nd. I got the amount of records increased per minute.
getHandler_write.sh
#!/bin/bash
while true
do
cat <<_EOF_
SHOW GLOBAL STATUS LIKE 'Handler_write';
_EOF_
sleep 60
done | mysql -u root -p -N
command
$ sh getHandler_write.sh
Enter password:
Handler_write 1289808074
Handler_write 1289814597
Handler_write 1289822748
Handler_write 1289829789
Handler_write 1289836322
Handler_write 1289844916
Handler_write 1289852226
3rd. I computed the speed of recording.
According to the result of 2. ,the speed of recording is
7233 records/minutes
4th. Then the time remaining is
(632047759/7233)/60/24 = 60 days
Upvotes: 16
Views: 30298
Reputation: 4257
Those are pretty big tables, so I'd expect the indexing to be pretty slow. 630 million records is a LOT of data to index. One thing to look at is partitioning, with data sets that large, without correctly partitioned tables, performance will be sloooow. Here's some useful links: using partioning on slow indexes you could also try looking at the buffer size settings for building the indexes (the default is 8MB, do for your large table that's going to slow you down a fair bit. buffer size documentation
Upvotes: 9