Reputation: 376
I'm trying to read a 5MM line file and right now it is exceeding my alotted memory usage on heroku. My method is somewhat fast ~200 inserts/second.. I believe it's crashing on the import.. So my plan was to import in batches of 1,000 or 10,000. My question is how do I tell that I'm at the end of my file, ruby has a .eof
method but its a File
method, and I'm not sure how to call it in my loop
def self.import_parts_db(file)
time = Benchmark.measure do
Part.transaction do
parts_db = []
CSV.parse(File.read(file), headers: true) do |row|
row_hash = row.to_hash
part = Part.new(
part_num: row_hash["part_num"],
description: row_hash["description"],
manufacturer: row_hash["manufacturer"],
model: row_hash["model"],
cage_code: row_hash["cage_code"],
nsn: row_hash["nsn"]
)
parts_db << part
end
Part.import parts_db
end
end
puts time
end
Upvotes: 4
Views: 4385
Reputation: 1
Try a different CSV. Had one that was about 30 megs that used 8 remaining gigs of RAM, resaving the file seemed to have corrected my issue.
Upvotes: -3
Reputation: 54223
As soon as you use File.read(file)
with a huge file, your script will use a lot of memory (possibly too much). You read the entire file into 1 huge string, even though CSV
reads it line by line.
It might work fine when you use files with thousands of rows. Still, you should use CSV.foreach. Change
CSV.parse(File.read(file), headers: true) do |row|
to
CSV.foreach(file, headers: true) do |row|
In this example, memory usage went from 1GB to 0.5MB.
parts_db
becomes a huge Array of Parts, which keeps growing until the very end of the CSV file.
You need to either remove the transaction (import will be slow but won't require more memory than for 1 row) or process the CSV in batches.
Here's one possibility to do it. We use CSV.parse
again, but only with batches of 2000 lines :
def self.import_parts_db(filename)
time = Benchmark.measure do
File.open(filename) do |file|
headers = file.first
file.lazy.each_slice(2000) do |lines|
Part.transaction do
rows = CSV.parse(lines.join, write_headers: true, headers: headers)
parts_db = rows.map do |_row|
Part.new(
part_num: row_hash['part_num'],
description: row_hash['description'],
manufacturer: row_hash['manufacturer'],
model: row_hash['model'],
cage_code: row_hash['cage_code'],
nsn: row_hash['nsn']
)
end
Part.import parts_db
end
end
end
puts time
end
end
The previous answer shouldn't use much memory, but it still could take a long time to import everything, possibly too much for a remote server.
The advantage of using an Enumerator is that it's easy to skip batches, and get just the ones you want.
Let's say your import takes too long, and stops for some reason after 424000 successful imports.
You could replace :
file.lazy.each_slice(2000) do |lines|
by
file.lazy.drop(424_000).take(300_000).each_slice(2000) do |lines|
To skip the first 424000 CSV lines, and parse the next 300000 ones.
For the next import, use :
file.lazy.drop(424_000+300_000).take(300_000).each_slice(2000) do |lines|
and then :
file.lazy.drop(424_000+2*300_000).take(300_000).each_slice(2000) do |lines|
...
Upvotes: 10
Reputation: 9226
CSV.parse
is pretty efficient, passing one parsed CSV line to the block that does the processing.
The problem doesn't come from the CSV parser, it comes from building the parts_db
array in memory. I suggest rewriting the Part.import
method to import the data line by line, instead of an entire array of records at once.
Upvotes: 0