Dan Stevens
Dan Stevens

Reputation: 6830

Extracting data from file using regex - match line at a time or entire file?

I have a program that reads each line of file, extracting data according to specific format, defined by a regular expression. Instead of calling Match() multiple times against each line in the file, I could call Match() against the entire contents of the file. Which is a more efficient solution?

The latter choice would require the RegexOptions.Multiline option.

Update:

The file is specified by the end-user so it could be large (~37000 lines, ~2MB). It is not necessary for every line to contain a valid entry.

The regular expression I'm using is ^\s*(OPTL_\w*)\s*=>\s*(\d+)\s*$. For example, this would match the a line consisting of the text OPTL_Example => 123, but would not match a line consisting of the text FooBar => 999.

Upvotes: 1

Views: 641

Answers (4)

paparazzo
paparazzo

Reputation: 45096

So depends on if you are optimizing for speed or stability.

If this is an end user app and don't have control of file size or memory then I would take the safe route and read line by line to protect memory. Clearly build the regex outside the loop so you are just calling .Match in the loop. ReadLine is pretty fast.

Could set up some parallel processing so it is reading the next line while it is performing the parse. But that simple regex would be so fast not sure it would be faster. Line at a time or entire file the disk IO to read the file is most likely the slowest operation.

If this is a server app with limited distribution and speed it critical then read it all in.

Upvotes: 2

Ωmega
Ωmega

Reputation: 43673

There is no general and/or correct answer for it, as it depends on many factors. The major one i speed of your I/O. Why don't you just test both solutions? With size of 2MB I would expect to work with entire content to be faster and more efficient.

Upvotes: 0

Ondra
Ondra

Reputation: 1647

Choosing the line by line solution could allow you to run regexes in parallel. The question is if all the overhead with parallel processing is worth it. If your regex is complicated and/or you do some other processing of lines, that can be run in parallel I would at least try it.

Upvotes: 0

dueyfinster
dueyfinster

Reputation: 327

It depends on memory constraints you need. If you have multiple regexes you can run on the file as whole, it is as efficient to keep the whole file in memory. However if your regexes modify lines (and then repeat this process, with cascading regexes that depend on one another) I'd go for a line by line solution.

Upvotes: 0

Related Questions