Reputation: 35
I am trying to import a large text file (approximately 2 million rows of numbers at 260MB) into an array, make edits to the array, and then write the results to a new text file, by writing:
file_data = File.readlines("massive_file.txt")
file_data = file_data.map!(&:strip)
file_data.each do |s|
s.gsub!(/,.*\z/, "")
end
File.open("smaller_file.txt", 'w') do |f|
f.write(file_data.map(&:strip).uniq.join("\n"))
end
However, I have received the error failed to allocate memory (NoMemoryError)
. How can I allocate more memory to complete the task? Or, ideally, is there another method I can use where I can avoid having to re-allocate memory?
Upvotes: 2
Views: 3116
Reputation: 2733
You can read the file line by line:
require 'set'
require 'digest/md5'
file_data = File.new('massive_file.txt', 'r')
file_output = File.new('smaller_file.txt', 'w')
unique_lines_set = Set.new
while (line = file_data.gets)
line.strip!
line.gsub!(/,.*\z/, "")
# Check if the line is unique
line_hash = Digest::MD5.hexdigest(line)
if not unique_lines_set.include? line_hash
# It is unique so add its hash to the set
unique_lines_set.add(line_hash)
# Write the line in the output file
file_output.puts(line)
end
end
file_data.close
file_output.close
Upvotes: 2
Reputation: 16506
Alternatively you can read file in chunks which should be faster compared to reading it line by line:
FILENAME="massive_file.txt"
MEGABYTE = 1024*1024
class File
def each_chunk(chunk_size=MEGABYTE) # or n*MEGABYTE
yield read(chunk_size) until eof?
end
end
filedata = ""
open(FILENAME, "rb") do |f|
f.each_chunk() {|chunk|
chunk.gsub!(/,.*\z/, "")
filedata += chunk
}
end
ref: https://stackoverflow.com/a/1682400/3035830
Upvotes: -1
Reputation: 7779
You can try reading and writing one line at once:
new_file = File.open('smaller_file.txt', 'w')
File.open('massive_file.txt', 'r') do |file|
file.each_line do |line|
new_file.puts line.strip.gsub(/,.*\z/, "")
end
end
new_file.close
The only thing pending is find duplicated lines
Upvotes: 0