Reputation: 21877
I have a very large CSV file, ~ 800,000 lines. I would like to attempt to process this file in parellel to speed up my script.
How does one use Ruby to break a file into n number of smaller pieces?
Upvotes: 1
Views: 3522
Reputation: 438
For csv files, you can do this:
open("your_file.csv").each_line do |line|
# do your stuff here like split lines
line.split(",")
# or store them in an array
some_array << line
# or write them back to a file
some_file_handler << line
end
By storing lines(or splitted lines) in array(memory) or file, you can break a large file into smaller pieces. After that, threads can be used to process each piece:
threads = []
1.upto(5) { |i| threads << Thread.new { do your stuff with file[i] } }
threads.each(&:join)
Notice you are responsible for keeping threads safe.
Hope this helps!
update:
According to pguardiario's advice, we can use csv from stand library instead of opening the file directly.
Upvotes: 1
Reputation: 33732
breaking up the CSV file into chunks is in order, but you have to keep in mind that each chunk needs to keep the first line with the CSV-header!
So UNIX 'split' will not cut it!
You'll have to write your own little Ruby script which reads the first line and stores it in a variable, then distributes the next N lines to a new partial CSV file, but first copying the CSV-header line into it. etc..
After creating each file with the header and a chunk of lines, you could then use Resque to enlist those files for parallel processing by a Resque worker.
http://railscasts.com/episodes/271-resque
Upvotes: 2
Reputation: 688
I would use linux split command to split this file into many smaller files. then, would process these smaller parts.
Upvotes: 0