Reputation: 360
How to parse a large file with regular expressions (using the re
module), without loading the whole file in string (or memory)? Memory mapped files don't help because their content can't be converted to some kind of lazy string. The re
module only supports string as content argument.
#include <boost/format.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/regex.hpp>
#include <iostream>
int main(int argc, char* argv[])
{
boost::iostreams::mapped_file fl("BigFile.log");
//boost::regex expr("\\w+>Time Elapsed .*?$", boost::regex::perl);
boost::regex expr("something usefull");
boost::match_flag_type flags = boost::match_default;
boost::iostreams::mapped_file::iterator start, end;
start = fl.begin();
end = fl.end();
boost::match_results<boost::iostreams::mapped_file::iterator> what;
while(boost::regex_search(start, end, what, expr))
{
std::cout<<what[0].str()<<std::endl;
start = what[0].second;
}
return 0;
}
To demonstrate my requirements. I wrote a short sample using C++(and boost) the same I want to have in Python.
Upvotes: 6
Views: 3608
Reputation: 360
Everything now works ok(Python 3.2.3 has some differences with Python 2.7 in interface). Search patter should be just prefixed with b" to have a working solution(in Python 3.2.3).
import re
import mmap
import pprint
def ParseFile(fileName):
f = open(fileName, "r")
print("File opened succesfully")
m = mmap.mmap(f.fileno(), 0, access = mmap.ACCESS_READ)
print("File mapped succesfully")
items = re.finditer(b"\\w+>Time Elapsed .*?\n", m)
for item in items:
pprint.pprint(item.group(0))
if __name__ == "__main__":
ParseFile("testre")
Upvotes: 8
Reputation: 1720
To elaborate on Julian's solution, you could achieve chunking (if you want to do multiline regexes) by storing and concatenating consecutive lines, like so:
list_prev_lines = []
for i in range(N):
list_prev_lines.append(f.readline())
for line in f:
list_prev_lines.pop(0)
list_prev_lines.append(line)
parse(string.join(list_prev_lines))
This will keep a running list of the previous N lines, the current line included, and then parse the multi-line group as a single string.
Upvotes: 1
Reputation: 3429
It depends on what sort of parsing you're doing.
If the parsing you're doing is linewise, you can iterate over the lines of a file with:
with open("/some/path") as f:
for line in f:
parse(line)
otherwise, you'll need to use something like chunking, by reading chunks at a time and parsing them. Obviously this would involve being much more careful in case what you're trying to match overlaps with chunk boundaries.
Upvotes: 6