Reputation: 12539
I have a few very large log files, and I need to parse them. Ease of implementation obviously points me to Perl and regex combo (in which I am a still novice). But what about speed? Will it be faster to implement it in C? Each log file is in the order of 2 GB.
Upvotes: 20
Views: 10821
Reputation: 27173
If you are equally skilled in C and Perl, the answer is simple:
Generally, I'd say this applies unless you are some sort of C godlet that can deftly manipulate the foundations of reality through puissant manipulation of pointers and typecasts.
Seriously, the regex implementation in perl is very fast, flexible and well tested. Any code you write may be fast and flexible, but it ca never be as thoroughly tested.
Since you are new to Perl and regex, it is important to remember that there are resources that can provide you with excellent help if you need it. There are even some nice tutorials in the fine manual.
Whatever you do, don't do this:
for my $line ( <$log> ) {
# parse line here.
}
You will read the whole log file into memory and it will take forever as your system swaps and swaps (and possibly crashes).
Instead use a while loop:
while (defined( my $line = <$log> )) {
# parse line here.
}
Upvotes: 17
Reputation: 881093
I very much doubt C will be faster than Perl unless you were to hand-compile the RE.
By hand-compiling, I mean coding the finite state machine (FSM) directly rather than using the RE engine to compile it. This approach means you can optimize it for your specific case which can often be faster than relying on the more general-purpose engine.
But that's not something I'd ever suggest to anyone who hasn't had to write compilers or parsers before without the benefit of lex, yacc, bison or other similar tools.
The generalized engines, such as PCRE, are usually powerful and fast enough (for my needs anyway, and those needs have often been very demanding).
When using a general RE engine, it needs to be able to handle all sorts of cases whether it's written in C or Perl. When you think about which is faster, you only have to compare what the RE engines are written in for both cases (hint: the Perl RE engine is not written in Perl).
They're both written in C so you should find very little difference in terms of the matching speed.
You may find differences in the support code around the REs but that will be minimal, especially if it's a simple read/match/output loop.
Upvotes: 44
Reputation: 745
If you want to read 2 Gb by perl it is better to use sysread (with big enougth block size, e. g. 256k or 512k). PerlIO uses too small block size - 4k, it is inefficient. See PerlMonks for more info about PerlIO block size.
Upvotes: 1
Reputation: 104065
The Perl regex matcher is heavily optimized. This is where Perl shines, you should have no trouble working with a 2GB file in Perl and the performance should be easily comparable to the C version. By the way: Did you try to look for an already finished log parser? There are plenty of them.
Upvotes: 20
Reputation: 3418
If you are going to be applying the same regular expression to every line, don't forget that you can greatly optimize the execution by appending the /o flag to the pattern, i.e.
if(/[a-zA-Z]+/o)
This will cause the expression to be compiled internally only once and for that result to be subsequently re-used, instead of on every successive loop iteration.
Armed with that enhancement, I would be very surprised if your Perl parser didn't walk all over whatever C implementation you'd feasibly be able to come up with in a realistic amount of time.
Upvotes: 1
Reputation: 1691
If you are parsing logs in Apache common log format, visitors, which is written in C will beat any comparable perl log parser by at least a factor 2.
So find existing parsers and benchmark them if the log format is common.
A properly written log parser in C will always be significantly faster than a properly written log parser in Perl, based on my past experiences.
Upvotes: 1
Reputation: 44794
Yes, you can make a much faster parser in C if you know what you are doing.
However, for the vast majority of people a smarter thing to worry about would be ease of implementation and maintenence of the code. A fast parser that you can't get to work right does nobody any good.
Upvotes: 3
Reputation: 124257
Upvotes: 22
Reputation: 7010
Part of this depends on how the parsing will be integrated into an application. If the application IS the parser, then Perl will be fine, just due to that it will handle everything surrounding it too, but if it's integrated DIRECTLY into a larger application, then it's fully possible that you may want to look into something like Lex (or Flex these days): http://en.wikipedia.org/wiki/Lex_(software) This tool generates the parser for you, and you can integrate the C/C++ code directly into your software.
As for speed considerations, I agree with most other responders here that the maturity of the library used will be the dominant factor, and Perl's is VERY mature. I don't know how mature some of the other libraries are (like the regex one available for C++ from Boost), but being as most of your processing time will be in the library, language concerns are likely secondary.
Bottom line: use what you're most comfortable with, and do as much work as possible inside the library, as it's almost-always faster than what you can produce yourself, in any language.
Upvotes: 3
Reputation: 36832
If you are proficient in Perl, use it. Otherwise, use AWK and SED.
Parsing text is not what you want to do with C.
Upvotes: 2
Reputation: 47829
Is speed really a factor here? Do you actually care whether parsing is done after 5 or 10 minutes?
Go for the language or tool that offers the best parsing features and that you are most familar with.
Upvotes: 7
Reputation: 98378
I'm guessing (in lieu of benchmarking against Alphaneo's actual data, which I don't have) that I/O processing is going to be the bounding factor here. And I'd expect a Perl implementation on a perl with usefaststdio enabled to match or beat a basic C implementation, but to be noticeably slower without usefaststdio. (usefaststdio was on by default in perl 5.8 and earlier for most platforms and off by default in perl 5.10.)
Upvotes: 8
Reputation:
If you actually need to use regexes, then the Perl regex engine is hard to beat. However, many parsing problems can be solved more efficiently without them - for example, if you just need to split a line at a certain character, in which case C will probably be faster.
If performance is of overriding importance, then you should try both languages, and measure the speed difference. Otherwise, simply use the one you are most comfortable with.
Upvotes: 13
Reputation: 46773
Perl obviously has some overhead compared to C. But this overhead may be negligible if you spend most of the time inside the Perl Regex functions implemented in C.
Upvotes: 4
Reputation: 300489
In the past, I have found C to be faster, but not to the extent that the choice was a foregone conclusion.
Have you thought about using a generic Log Parser tool, such as Log Parser:
Log parser is a powerful, versatile tool that provides universal query access to text-based data such as log files, XML files and CSV files, as well as key data sources on the Windows® operating system such as the Event Log, the Registry, the file system, and Active Directory®.
This site lists a few generic log parsers.
Upvotes: 4