Reputation: 213
I'll try to extend a title of my question. I work on ruby project. I have to process a big amount of data (around 120000) stored in CSV files. I have to read this data, process and put in DB. Now it takes couple days. I have to make this much faster. The problem is that sometimes during processing I get some anomalies and I have to repeat whole import process. I decided that more important is to improve performance instead of looking for bug using small amount of data. For now I stick to CSV files. I decided to benchmark processing script to find bottle necks and improve loading data from CSV. I see following steps:
For now I use standard ruby CSV library. Do you recommend some better gem?
If some of you are familiar in similar problem It would be happy to get to know you opinion.
Edit:
Database: postgrees
System: linux
Upvotes: 3
Views: 811
Reputation: 207465
Make sure your Ruby script can process multiple files given as parameters, i.e. that you can run it like this:
script.rb abc.csv def.csv xyz.csv
Then parallelise it using GNU Parallel
like this to keep all your CPU cores busy:
find . -name \*.csv -print0 | parallel -0 -X script.rb
The -X
passes as many CSV files as possible to your Ruby job without exceeding maximum length of command line. You can add in -j 8
after parallel
if you want GNU Parallel
to run say 8 jobs at a time, and you can use --eta
to get the estimated arrival/finish time:
find . -name \*.csv -print0 | parallel -0 -X -j 8 --eta script.rb
By default, GNU Parallel
will run as many jobs in parallel as you have CPU cores.
Upvotes: 2
Reputation: 42192
I haven't had the opportunity to test it myself but refently I crossed this article, seems to do the job.
You'll have to adapt to use CSV instead of XLSX. For future reference if the site would stop here the code. It works by writing BATCH_IMPORT_SIZE records at the database at the same time, should give a huge profit.
class ExcelDataParser
def initialize(file_path)
@file_path = file_path
@records = []
@counter = 1
end
BATCH_IMPORT_SIZE = 1000
def call
rows.each do |row|
increment_counter
records << build_new_record(row)
import_records if reached_batch_import_size? || reached_end_of_file?
end
end
private
attr_reader :file_path, :records
attr_accessor :counter
def book
@book ||= Creek::Book.new(file_path)
end
# in this example, we assume that the
# content is in the first Excel sheet
def rows
@rows ||= book.sheets.first.rows
end
def increment_counter
self.counter += 1
end
def row_count
@row_count ||= rows.count
end
def build_new_record(row)
# only build a new record without saving it
RecordModel.new(...)
end
def import_records
# save multiple records using activerecord-import gem
RecordModel.import(records)
# clear records array
records.clear
end
def reached_batch_import_size?
(counter % BATCH_IMPORT_SIZE).zero?
end
def reached_end_of_file?
counter == row_count
end
end
Upvotes: 3
Reputation: 1357
There are a few ways of going about it. I personally would recommend SmarterCSV, which makes it much faster and easier to process CSVs using Array of Hashes. You should definitely split up the work if possible, perhaps making a queue of files to process and do it in batches with use of Redis
Upvotes: 1