Reputation:
I'm pretty new to programming, so be gentle. I'm trying to extract IBSN numbers from a library database .dat file. I have written code that works, but it is only searching through about half of the 180MB file. How can I adjust it to search the whole file? Or how can I write a program the will split the dat file into manageable chunks?
edit: Here's my code:
export = File.new("resultsfinal.txt","w+")
File.open("bibrec2.dat").each do |line|
line.scan(/[a]{1}[1234567890xX]{10}\W/) do |x|
export.puts x
end
line.scan(/[a]{1}[1234567890xX]{13}/) do |x|
export.puts x
end
end
Upvotes: 6
Views: 7756
Reputation: 6082
You can look into using File#truncate
and IO#seek
and employ the binary search type algorithm. #truncate
may be destructive so you should duplicate the file (I know this is a hassle).
middle = File.new("my_huge_file.dat").size / 2
tmpfile = File.new("my_huge_file.dat", "r+").truncate(middle)
# run search algoritm on 'tmpfile'
File.open("my_huge_file.dat") do |huge_file|
huge_file.seek(middle + 1)
# run search algorithm from here
end
The code is highly untested, brittle and incomplete. But I hope it gives you a platform to build of off.
Upvotes: 1
Reputation: 55002
The main thing is to clean up and combine the regex for performance benefits. Also you should always use block syntax with files to ensure the fd's are getting closed properly. File#each doesn't load the whole file into memory, it does one line at a time:
File.open("resultsfinal.txt","w+") do |output|
File.open("bibrec2.dat").each do |line|
output.puts line.scan(/a[\dxX]{10}(?:[\dxX]{3}|\W)/)
end
end
Upvotes: 3
Reputation: 108
file = File.new("bibrec2.dat", "r")
while (line = file.gets)
line.scan(/[a]{1}[1234567890xX]{10}\W/) do |x|
export.puts x
end
line.scan(/[a]{1}[1234567890xX]{13}/) do |x|
export.puts x
end
end
file.close
Upvotes: 2
Reputation: 1300
You should try to catch exception to check if the problem is really on the read block or not.
Just so you know I already made a script with kinda the same syntax to search real big file of ~8GB without problem.
export = File.new("resultsfinal.txt","w+")
File.open("bibrec2.dat").each do |line|
begin
line.scan(/[a]{1}[1234567890xX]{10}\W/) do |x|
export.puts x
end
line.scan(/[a]{1}[1234567890xX]{13}/) do |x|
export.puts x
end
rescue
puts "Problem while adding the result"
end
end
Upvotes: 4
Reputation: 52326
As to the performance issue, I can't see anything particularly worrying about the file size: 180MB shouldn't pose any problems. What happens to memory use when you're running your script?
I'm not sure, however, that your Regular Expressions are doing what you want. This, for example:
/[a]{1}[1234567890xX]{10}\W/
does (I think) this:
There are a couple of sample ISBN matchers here and here, although they seem to be matching something more like the format that we see on the back cover of a book and I'm guessing your input file has stripped out some of that formatting.
Upvotes: 1
Reputation: 5023
If you are programming on a modern operating system and the computer has enough memory (say 512megs), Ruby should have no problem reading the entire file into memory.
Things typically get iffy when you get to about a 2 gigabyte working set on a typical 32bit OS.
Upvotes: -2