xoail
xoail

Reputation: 3064

MySQL Large Datasets

I have large sets of data. Over 40GB that I loaded in MySQL table. I am trying to perform simple queries like select * from tablename but it takes gazillion minutes to run and eventually times out. If I set a limit, the execution is fairly fast ex: select * from tablename limit 1000. The table has over 200 million records.

Any tips on working with these types of datasets?

Upvotes: 4

Views: 8145

Answers (2)

bman
bman

Reputation: 5235

For large databases, one should consider using an alternative solutions such as Apache Spark. MySQL reads the data from disk which is a slow operation. Nothing can work as fast as a technology that is based on MapReduce. Take a look to this answer. It is true that with large databases, queries get very challenging.

Anyway assuming you want to stick with MySQL, first of all if you are using MyISAM, make sure to convert your database storage to InnoDB. This is especially important if you have lots of read/write operations.

It is also important to partition, that reduce the table into more manageable smaller tables. It will also enhance the indexes performance.

Do not be too generous with adding indexes. Define indexes wisely. If an index does not need to be UNIQUE do not define it as one. If an index does not need to include multiple fields do not include multiple fields.

Most importantly start monitor your MySQL instance. Use SHOW ENGINE INNODB STATUS to investigate the performance of your MySQL instance.

Upvotes: 2

N.B.
N.B.

Reputation: 14071

First thing you need to do is completely ignore all answers and comments advising some other, awesome, mumbo jumbo technology. It's absolute bullshit. Those things can't work in a different way because they're all constrained with the same problem - hardware.

Now, let's get back to MySQL. The problem with LIMIT is that MySQL takes the whole data set, then takes LIMIT amount of rows starting from OFFSET. That means if you do SELECT * FROM my_table LIMIT 1000 - it will take all 200 million rows, buffer them, then it will start counting from 0 to 999 and discard the rest.

Yes, it takes time. Yes, it appears as dumb. However, MySQL doesn't know what "start" or "end" mean, so it can't know what limit and offset are until you tell it so.

To improve your search, you can use something like this (assuming you have numeric primary key):

SELECT * FROM tablename WHERE id < 10000 LIMIT 1000;

In this case, instead of with 200 million rows, MySQL will work with all rows whose PK is below 10 000. Much easier, much quicker, also readable. Numbers can be tweaked at any point and if you perform a pagination of some sort in a scripting language, you can always transfer the last numeric id that was present so MySQL can start from that id onwards in its search.

Also, you should be using InnoDB engine, and tweak it using innodb_buffer_pool_size which is the magic sauce that makes MySQL fly.

Upvotes: 11

Related Questions