dewastator
dewastator

Reputation: 213

Processing big amount of CSV files

I'll try to extend a title of my question. I work on ruby project. I have to process a big amount of data (around 120000) stored in CSV files. I have to read this data, process and put in DB. Now it takes couple days. I have to make this much faster. The problem is that sometimes during processing I get some anomalies and I have to repeat whole import process. I decided that more important is to improve performance instead of looking for bug using small amount of data. For now I stick to CSV files. I decided to benchmark processing script to find bottle necks and improve loading data from CSV. I see following steps:

  1. Benchmark and fix the most problematic bottle necks
  2. Maybe split loading from CSV and processing. For example create separate table and load data there. In next step load this data, process and put in right table.
  3. Introduce threads to load data from CSV

For now I use standard ruby CSV library. Do you recommend some better gem?

If some of you are familiar in similar problem It would be happy to get to know you opinion.

Edit:

Database: postgrees

System: linux

Upvotes: 3

Views: 811

Answers (3)

Mark Setchell
Mark Setchell

Reputation: 207465

Make sure your Ruby script can process multiple files given as parameters, i.e. that you can run it like this:

script.rb abc.csv def.csv xyz.csv

Then parallelise it using GNU Parallel like this to keep all your CPU cores busy:

find . -name \*.csv -print0 | parallel -0 -X script.rb

The -X passes as many CSV files as possible to your Ruby job without exceeding maximum length of command line. You can add in -j 8 after parallel if you want GNU Parallel to run say 8 jobs at a time, and you can use --eta to get the estimated arrival/finish time:

 find . -name \*.csv -print0 | parallel -0 -X -j 8 --eta script.rb

By default, GNU Parallel will run as many jobs in parallel as you have CPU cores.

Upvotes: 2

peter
peter

Reputation: 42192

I haven't had the opportunity to test it myself but refently I crossed this article, seems to do the job.

https://infinum.co/the-capsized-eight/articles/how-to-efficiently-process-large-excel-files-using-ruby

You'll have to adapt to use CSV instead of XLSX. For future reference if the site would stop here the code. It works by writing BATCH_IMPORT_SIZE records at the database at the same time, should give a huge profit.

class ExcelDataParser
  def initialize(file_path)
    @file_path = file_path
    @records = []
    @counter = 1
  end

  BATCH_IMPORT_SIZE = 1000

  def call
    rows.each do |row|
      increment_counter
      records << build_new_record(row)
      import_records if reached_batch_import_size? || reached_end_of_file?
    end
  end

  private

  attr_reader :file_path, :records
  attr_accessor :counter

  def book
    @book ||= Creek::Book.new(file_path)
  end

  # in this example, we assume that the
  # content is in the first Excel sheet
  def rows
    @rows ||= book.sheets.first.rows
  end

  def increment_counter
    self.counter += 1
  end

  def row_count
    @row_count ||= rows.count
  end

  def build_new_record(row)
    # only build a new record without saving it
    RecordModel.new(...)
  end

  def import_records
    # save multiple records using activerecord-import gem
    RecordModel.import(records)

    # clear records array
    records.clear
  end

  def reached_batch_import_size?
    (counter % BATCH_IMPORT_SIZE).zero?
  end

  def reached_end_of_file?
    counter == row_count
  end
end

https://infinum.co/the-capsized-eight/articles/how-to-efficiently-process-large-excel-files-using-ruby

Upvotes: 3

Maxim Fedotov
Maxim Fedotov

Reputation: 1357

There are a few ways of going about it. I personally would recommend SmarterCSV, which makes it much faster and easier to process CSVs using Array of Hashes. You should definitely split up the work if possible, perhaps making a queue of files to process and do it in batches with use of Redis

Upvotes: 1

Related Questions