Reputation: 4586
I need to scan a 300MB text file with a regex.
Is there any lazy method to do a full file scan with a regex without reading it into a separate variable?
UPD
Done. Now you can use this function to read by chunks. Modify it for your goals.
def prepare_session_hash(fname, regex_string, start=0)
@session_login_hash = {}
File.open(fname, 'rb') { |f|
fsize = f.size
bsize = fsize / 8
if start > 0
f.seek(start)
end
overlap = 200
while true
if (f.tell() >= overlap) and (f.tell() < fsize)
f.seek(f.tell() - overlap)
end
buffer = f.read(bsize)
if buffer
buffer.scan(s) { |match|
@session_login_hash[match[0]] = match[1]
}
else
return @session_login_hash
end
end
}
end
Upvotes: 5
Views: 2030
Reputation: 43245
Example:
This is string with multline numbers -2000
2223434
34356666
444564646
. These numbers can occur at 34345
567567 places, and on 67
87878 pages . The problem is to find a good
way to extract these more than 100
0 regexes without memory hogging.
In this text, assume the desired pattern is numeric strings e.g /d+/s
match digits multiline,
Then instead of processing and loading whole file, you can chose a chunk creating pattern, say FULL STOP in this case .
and only read and process till this pattern, then move to next chunk.
CHUNK#1:
This is string with multline numbers -2000
2223434
34356666
444564646
.
CHUNK#2:
These numbers can occur at 34345
567567 places, and on 67
87878 pages
and so on.
EDIT: Adding @Ranty's suggestion from the comments as well:
Or simply read by some amount of lines, say 20. When you find the match within, clear up to the match end and append another 20 lines. No need for figuring frequently occurring 'X'.
Upvotes: 6