Reputation: 3064
I have large sets of data. Over 40GB that I loaded in MySQL table. I am trying to perform simple queries like select * from tablename
but it takes gazillion minutes to run and eventually times out. If I set a limit, the execution is fairly fast ex: select * from tablename limit 1000
.
The table has over 200 million records.
Any tips on working with these types of datasets?
Upvotes: 4
Views: 8145
Reputation: 5235
For large databases, one should consider using an alternative solutions such as Apache Spark. MySQL reads the data from disk which is a slow operation. Nothing can work as fast as a technology that is based on MapReduce. Take a look to this answer. It is true that with large databases, queries get very challenging.
Anyway assuming you want to stick with MySQL, first of all if you are using MyISAM, make sure to convert your database storage to InnoDB. This is especially important if you have lots of read/write operations.
It is also important to partition, that reduce the table into more manageable smaller tables. It will also enhance the indexes performance.
Do not be too generous with adding indexes. Define indexes wisely. If an index does not need to be UNIQUE
do not define it as one. If an index does not need to include multiple fields do not include multiple fields.
Most importantly start monitor your MySQL instance. Use SHOW ENGINE INNODB STATUS
to investigate the performance of your MySQL instance.
Upvotes: 2
Reputation: 14071
First thing you need to do is completely ignore all answers and comments advising some other, awesome, mumbo jumbo technology. It's absolute bullshit. Those things can't work in a different way because they're all constrained with the same problem - hardware.
Now, let's get back to MySQL. The problem with LIMIT
is that MySQL takes the whole data set, then takes LIMIT
amount of rows starting from OFFSET
. That means if you do SELECT * FROM my_table LIMIT 1000
- it will take all 200 million rows, buffer them, then it will start counting from 0 to 999 and discard the rest.
Yes, it takes time. Yes, it appears as dumb. However, MySQL doesn't know what "start" or "end" mean, so it can't know what limit and offset are until you tell it so.
To improve your search, you can use something like this (assuming you have numeric primary key):
SELECT * FROM tablename WHERE id < 10000 LIMIT 1000;
In this case, instead of with 200 million rows, MySQL will work with all rows whose PK is below 10 000. Much easier, much quicker, also readable. Numbers can be tweaked at any point and if you perform a pagination of some sort in a scripting language, you can always transfer the last numeric id that was present so MySQL can start from that id onwards in its search.
Also, you should be using InnoDB
engine, and tweak it using innodb_buffer_pool_size
which is the magic sauce that makes MySQL fly.
Upvotes: 11