Matt K
Matt K

Reputation: 35

'Failed to allocate memory' error with large array

I am trying to import a large text file (approximately 2 million rows of numbers at 260MB) into an array, make edits to the array, and then write the results to a new text file, by writing:

file_data = File.readlines("massive_file.txt")
file_data = file_data.map!(&:strip)
file_data.each do |s|
    s.gsub!(/,.*\z/, "")
end
File.open("smaller_file.txt", 'w') do |f|
    f.write(file_data.map(&:strip).uniq.join("\n"))
end

However, I have received the error failed to allocate memory (NoMemoryError). How can I allocate more memory to complete the task? Or, ideally, is there another method I can use where I can avoid having to re-allocate memory?

Upvotes: 2

Views: 3116

Answers (3)

Plamen Nikolov
Plamen Nikolov

Reputation: 2733

You can read the file line by line:

require 'set'
require 'digest/md5'
file_data = File.new('massive_file.txt', 'r')
file_output = File.new('smaller_file.txt', 'w')
unique_lines_set = Set.new

while (line = file_data.gets)
    line.strip!
    line.gsub!(/,.*\z/, "")
    # Check if the line is unique
    line_hash = Digest::MD5.hexdigest(line)
    if not unique_lines_set.include? line_hash
      # It is unique so add its hash to the set
      unique_lines_set.add(line_hash)

      # Write the line in the output file
      file_output.puts(line)
    end
end

file_data.close
file_output.close

Upvotes: 2

shivam
shivam

Reputation: 16506

Alternatively you can read file in chunks which should be faster compared to reading it line by line:

FILENAME="massive_file.txt"
MEGABYTE = 1024*1024

class File
  def each_chunk(chunk_size=MEGABYTE) # or n*MEGABYTE
    yield read(chunk_size) until eof?
  end
end

filedata = ""
open(FILENAME, "rb") do |f|
  f.each_chunk() {|chunk|
      chunk.gsub!(/,.*\z/, "")
      filedata += chunk
  }
end

ref: https://stackoverflow.com/a/1682400/3035830

Upvotes: -1

Aguardientico
Aguardientico

Reputation: 7779

You can try reading and writing one line at once:

new_file = File.open('smaller_file.txt', 'w')
File.open('massive_file.txt', 'r') do |file|
  file.each_line do |line|
    new_file.puts line.strip.gsub(/,.*\z/, "")
  end
end
new_file.close

The only thing pending is find duplicated lines

Upvotes: 0

Related Questions