JZ.
JZ.

Reputation: 21877

Processing a CSV file in parallel using ruby

I have a very large CSV file, ~ 800,000 lines. I would like to attempt to process this file in parellel to speed up my script.

How does one use Ruby to break a file into n number of smaller pieces?

Upvotes: 1

Views: 3522

Answers (3)

Tomato
Tomato

Reputation: 438

For csv files, you can do this:

open("your_file.csv").each_line do |line|
  # do your stuff here like split lines
  line.split(",")

  # or store them in an array
  some_array << line

  # or write them back to a file
  some_file_handler << line
end

By storing lines(or splitted lines) in array(memory) or file, you can break a large file into smaller pieces. After that, threads can be used to process each piece:

threads = []
1.upto(5) { |i| threads << Thread.new { do your stuff with file[i] } }

threads.each(&:join)

Notice you are responsible for keeping threads safe.

Hope this helps!

update:

According to pguardiario's advice, we can use csv from stand library instead of opening the file directly.

Upvotes: 1

Tilo
Tilo

Reputation: 33732

breaking up the CSV file into chunks is in order, but you have to keep in mind that each chunk needs to keep the first line with the CSV-header!

So UNIX 'split' will not cut it!

You'll have to write your own little Ruby script which reads the first line and stores it in a variable, then distributes the next N lines to a new partial CSV file, but first copying the CSV-header line into it. etc..

After creating each file with the header and a chunk of lines, you could then use Resque to enlist those files for parallel processing by a Resque worker.

http://railscasts.com/episodes/271-resque

Upvotes: 2

tartar
tartar

Reputation: 688

I would use linux split command to split this file into many smaller files. then, would process these smaller parts.

Upvotes: 0

Related Questions