Nick
Nick

Reputation:

How can I handle large files in Ruby?

I'm pretty new to programming, so be gentle. I'm trying to extract IBSN numbers from a library database .dat file. I have written code that works, but it is only searching through about half of the 180MB file. How can I adjust it to search the whole file? Or how can I write a program the will split the dat file into manageable chunks?

edit: Here's my code:

export = File.new("resultsfinal.txt","w+")

File.open("bibrec2.dat").each do |line|
  line.scan(/[a]{1}[1234567890xX]{10}\W/) do |x|
    export.puts x
  end
  line.scan(/[a]{1}[1234567890xX]{13}/) do |x|
    export.puts x
  end
end

Upvotes: 6

Views: 7756

Answers (6)

Igbanam
Igbanam

Reputation: 6082

You can look into using File#truncate and IO#seek and employ the binary search type algorithm. #truncate may be destructive so you should duplicate the file (I know this is a hassle).

middle = File.new("my_huge_file.dat").size / 2
tmpfile = File.new("my_huge_file.dat", "r+").truncate(middle)
# run search algoritm on 'tmpfile'
File.open("my_huge_file.dat") do |huge_file|
  huge_file.seek(middle + 1)
  # run search algorithm from here
end

The code is highly untested, brittle and incomplete. But I hope it gives you a platform to build of off.

Upvotes: 1

pguardiario
pguardiario

Reputation: 55002

The main thing is to clean up and combine the regex for performance benefits. Also you should always use block syntax with files to ensure the fd's are getting closed properly. File#each doesn't load the whole file into memory, it does one line at a time:

File.open("resultsfinal.txt","w+") do |output|
    File.open("bibrec2.dat").each do |line|
        output.puts line.scan(/a[\dxX]{10}(?:[\dxX]{3}|\W)/)
    end
end

Upvotes: 3

Stevenr12
Stevenr12

Reputation: 108

file = File.new("bibrec2.dat", "r")
while (line = file.gets)
  line.scan(/[a]{1}[1234567890xX]{10}\W/) do |x|
    export.puts x
  end
  line.scan(/[a]{1}[1234567890xX]{13}/) do |x|
    export.puts x
  end
end
file.close

Upvotes: 2

Yoann Le Touche
Yoann Le Touche

Reputation: 1300

You should try to catch exception to check if the problem is really on the read block or not.

Just so you know I already made a script with kinda the same syntax to search real big file of ~8GB without problem.

export = File.new("resultsfinal.txt","w+")

File.open("bibrec2.dat").each do |line|
  begin
    line.scan(/[a]{1}[1234567890xX]{10}\W/) do |x|
      export.puts x
    end
    line.scan(/[a]{1}[1234567890xX]{13}/) do |x|
      export.puts x
    end
  rescue
    puts "Problem while adding the result"
  end
end

Upvotes: 4

Mike Woodhouse
Mike Woodhouse

Reputation: 52326

As to the performance issue, I can't see anything particularly worrying about the file size: 180MB shouldn't pose any problems. What happens to memory use when you're running your script?

I'm not sure, however, that your Regular Expressions are doing what you want. This, for example:

/[a]{1}[1234567890xX]{10}\W/

does (I think) this:

  • one "a". Do you really want to match for an "a"? "a" would suffice, rather than "[a]{1}", in that case.
  • exactly 10 of (digit or "x" or "X")
  • a single "non-word" character i.e. not a-z, A-Z, 0-9 or underscore

There are a couple of sample ISBN matchers here and here, although they seem to be matching something more like the format that we see on the back cover of a book and I'm guessing your input file has stripped out some of that formatting.

Upvotes: 1

drudru
drudru

Reputation: 5023

If you are programming on a modern operating system and the computer has enough memory (say 512megs), Ruby should have no problem reading the entire file into memory.

Things typically get iffy when you get to about a 2 gigabyte working set on a typical 32bit OS.

Upvotes: -2

Related Questions