Wookai
Wookai

Reputation: 21793

How to export a large table (100M+ rows) to a text file?

I have a database with a large table containing more that a hundred million rows. I want to export this data (after some transformation, like joining this table with a few others, cleaning some fields, etc.) and store it int a big text file, for later processing with Hadoop.

So far, I tried two things:

  1. Using Python, I browse the table by chunks (typically 10'000 records at a time) using this subquery trick, perform the transformation on each row and write directly to a text file. The trick helps, but the LIMIT becomes slower and slower as the export progresses. I have not been able to export the full table with this.

  2. Using the mysql command-line tool, I tried to output the result of my query in CSV form to a text file directly. Because of the size, it ran out of memory and crashed.

I am currently investigating Sqoop as a tool to import the data directly to HDFS, but I was wondering how other people handle such large-scale exports?

Upvotes: 0

Views: 3833

Answers (2)

glglgl
glglgl

Reputation: 91159

Memory issues point towards using the wrong database query machanism.

Normally, it is advisable to use mysql_store_result() on C level, which corresponds to having a Cursor or DictCursor on Python level. This ensures that the database is free again as soon as possible and the client can do with thedata whatever he wants.

But it is not suitable for large amounts of data, as the data is cached in the client process. This can be very memory consuming.

In this case, it may be better to use mysql_use_result() (C) resp. SSCursor / SSDictCursor (Python). This limits you to have to take the whole result set and doing nothing else with the database connection in the meanwhile. But it saves your client process a lot of memory. With the mysql CLI, you would achieve this with the -q argument.

Upvotes: 2

Cjxcz Odjcayrwl
Cjxcz Odjcayrwl

Reputation: 22877

I don't know what query exactly you have used because you have not given it here, but I suppose you're specifying the limit and offset. This are quite quick queries at begin of data, but are going very slow.

If you have unique column such as ID, you can fetch only the first N row, but modify the query clause:

WHERE ID > (last_id)

This would use index and would be acceptably fast.

However, it should be generally faster to do simply

SELECT * FROM table

and open cursor for such query, with reasonable big fetch size.

Upvotes: 1

Related Questions