Reputation: 1821
How to persist large amounts of data by reading from a CSV file (say 20 million rows). This is running close to 1 1/2 days so far and has persisted only 10 million rows, how can I batch this so that it becomes faster and is there a possibility to run this in a parallel fashion.
I am using the code here to read the CSV, I would like to know if there is a better way to achieve this.
Refer: dealing with large CSV files (20G) in ruby
Upvotes: 3
Views: 984
Reputation: 11
Key points:
Sorry for my poor English.
Upvotes: 1
Reputation: 318
Since you would like to persist your data to MySQL for further processing, using Load Data Infile from MySQL would be faster. something like the following with your schema:
sql = "LOAD DATA LOCAL INFILE 'big_data.csv' INTO TABLE tests FIELDS TERMINATED BY ',' ENCLOSED BY '\"' LINES TERMINATED BY '\n' (foo,foo1)"
con = ActiveRecord::Base.connection
con.execute(sql)
Upvotes: 2
Reputation: 699
You can try to first split the file into several smaller files, then you will be able to process several files in parallel.
Probably for splinting the file it will be faster to user a tool like split
split -l 1000000 ./test.txt ./out-files-
Then while you are processing each of the files and assuming you are inserting records instead of inserting them one by one, you can combine them into batches and do bulk inserts. Something like:
INSERT INTO some_table
VALUES
(1,'data1'),
(2, 'data2')
For better performance you'll need to build the SQL statement yourself and execute it:
ActiveRecord::Base.connection.execute('INSERT INTO <whatever you have built>')
Upvotes: 4