Brisi
Brisi

Reputation: 1821

How to persist large amounts of data by reading from a CSV file

How to persist large amounts of data by reading from a CSV file (say 20 million rows). This is running close to 1 1/2 days so far and has persisted only 10 million rows, how can I batch this so that it becomes faster and is there a possibility to run this in a parallel fashion.

I am using the code here to read the CSV, I would like to know if there is a better way to achieve this.

Refer: dealing with large CSV files (20G) in ruby

Upvotes: 3

Views: 984

Answers (3)

sulpha
sulpha

Reputation: 11

Key points:

  1. If you use MySQL InnoDB engine, my advice is that always define a auto-increment PRIMARY KEY, InnoDB uses clustered index to store data in the table. A clustered index determines the physical order of data in a table.
    refer: http://www.ovaistariq.net/521/understanding-innodb-clustered-indexes/
  2. Config your MySQL Server parameters, the most important ones are
    (1) close mysql binlog
    (2) innodb_buffer_pool_size.
    (3) innodb_flush_log_at_trx_commit
    (4) bulk_insert_buffer_size
    You can read this: http://www.percona.com/blog/2013/09/20/innodb-performance-optimization-basics-updated/
  3. You should use producer-consumer scenario.

Sorry for my poor English.

Upvotes: 1

Graham
Graham

Reputation: 318

Since you would like to persist your data to MySQL for further processing, using Load Data Infile from MySQL would be faster. something like the following with your schema:

sql = "LOAD DATA LOCAL INFILE 'big_data.csv' INTO TABLE tests FIELDS TERMINATED BY ',' ENCLOSED BY '\"' LINES TERMINATED BY '\n' (foo,foo1)"

con = ActiveRecord::Base.connection

con.execute(sql)

Upvotes: 2

Kamen
Kamen

Reputation: 699

You can try to first split the file into several smaller files, then you will be able to process several files in parallel.

Probably for splinting the file it will be faster to user a tool like split

split -l 1000000 ./test.txt ./out-files-

Then while you are processing each of the files and assuming you are inserting records instead of inserting them one by one, you can combine them into batches and do bulk inserts. Something like:

INSERT INTO some_table 
VALUES 
(1,'data1'), 
(2, 'data2')

For better performance you'll need to build the SQL statement yourself and execute it:

ActiveRecord::Base.connection.execute('INSERT INTO <whatever you have built>')

Upvotes: 4

Related Questions