Hartator
Hartator

Reputation: 5145

Reading and Writing the same CSV file in Ruby

I have some processing to do involving a third party API, and I was planning to use a CSV file as a backlog of things to do.

Example

Task to do   Resulting file
#1           data/1.json
#2           data/2.json
#3           

So, #1 and #2 are already done. I want to work on #3, and save the CSV file as soon as data/3.json is completed.

As the task is unstable and error prone, I want to save progress after each task in the CSV file.

I've written this script in Ruby, it's working well, but as tasks are numerous (> 100k), it's written couple Megabytes to disk each time a task is processed. The whole thing. It seems a good way to kill my HD:

class CSVResolver

  require 'csv'

  attr_accessor :csv_path

  def initialize csv_path:
    self.csv_path = csv_path
  end

  def resolve
    csv = CSV.read(csv_path)
    csv.each_with_index do |row, index|
      next if row[1] # Don't do anything if we've already processed this task, and got a JSON data
      json = very_expensive_task_and_error_prone
      row[1] = "/data/#{index}.json"
      File.write row[1], JSON.pretty_generate(json)
      csv[index] = row
      CSV.open(csv_path, "wb") do |old_csv|
        csv.each do |row|
          old_csv << row
        end
      end
      resolve
    end
  end

end

Is there any way to improve on this, like making the write to CSV file atomic?

Upvotes: 0

Views: 1423

Answers (2)

mahemoff
mahemoff

Reputation: 46409

I'd use an embedded database for this purpose, such as SQLite or LevelDB.

Unlike a regular database, you'll still get many of the benefits of a CSV file, ie it can be stored in a single file/folder and without any server or permissioning hassle. At the same time, you'll get the benefit of better I/O characteristic than reading and writing a monolithic file upon each update ... the library should be smart enough to be able to index records, minimise changes, and store things in memory while buffering output.

Upvotes: 3

lacostenycoder
lacostenycoder

Reputation: 11216

For data persistence you would be, in most cases, best served to select a tool designed for the job, a database. You've already named enough of a reason to not use the hand spun CSV design as it is memory inefficient and proposes more problems then it likely solves. Also, depending on the amount of data you need to process via the 3rd part API, you may want to handle multi-threaded processes where reading/writing to a single file won't work.

You might wanna checkout https://github.com/jeremyevans/sequel

Upvotes: 1

Related Questions