Reputation: 5145
I have some processing to do involving a third party API, and I was planning to use a CSV file as a backlog of things to do.
Example
Task to do Resulting file
#1 data/1.json
#2 data/2.json
#3
So, #1 and #2 are already done. I want to work on #3, and save the CSV file as soon as data/3.json
is completed.
As the task is unstable and error prone, I want to save progress after each task in the CSV file.
I've written this script in Ruby, it's working well, but as tasks are numerous (> 100k), it's written couple Megabytes to disk each time a task is processed. The whole thing. It seems a good way to kill my HD:
class CSVResolver
require 'csv'
attr_accessor :csv_path
def initialize csv_path:
self.csv_path = csv_path
end
def resolve
csv = CSV.read(csv_path)
csv.each_with_index do |row, index|
next if row[1] # Don't do anything if we've already processed this task, and got a JSON data
json = very_expensive_task_and_error_prone
row[1] = "/data/#{index}.json"
File.write row[1], JSON.pretty_generate(json)
csv[index] = row
CSV.open(csv_path, "wb") do |old_csv|
csv.each do |row|
old_csv << row
end
end
resolve
end
end
end
Is there any way to improve on this, like making the write to CSV file atomic?
Upvotes: 0
Views: 1423
Reputation: 46409
I'd use an embedded database for this purpose, such as SQLite or LevelDB.
Unlike a regular database, you'll still get many of the benefits of a CSV file, ie it can be stored in a single file/folder and without any server or permissioning hassle. At the same time, you'll get the benefit of better I/O characteristic than reading and writing a monolithic file upon each update ... the library should be smart enough to be able to index records, minimise changes, and store things in memory while buffering output.
Upvotes: 3
Reputation: 11216
For data persistence you would be, in most cases, best served to select a tool designed for the job, a database. You've already named enough of a reason to not use the hand spun CSV design as it is memory inefficient and proposes more problems then it likely solves. Also, depending on the amount of data you need to process via the 3rd part API, you may want to handle multi-threaded processes where reading/writing to a single file won't work.
You might wanna checkout https://github.com/jeremyevans/sequel
Upvotes: 1