savyuk
savyuk

Reputation: 291

CSV-parsing regular expression performance

I'm using a regex to parse a CSV-like file. I'm new to regular expressions, and, while it works, it gets slow when there are many fields AND one of the fields contains a very long value. How can I optimize it?

The CSV I have to parse is of the following flavor:

  1. All fields are strings enclosed in quotes separated by commas
  2. Quotes inside fields are escaped in the form of two consecutive quotes
  3. There is unpredictable garbage at the start of some lines which needs to be ignored (so far it didn't contain quotes, thankfully)
  4. Zero-length fields and newlines in fields are possible

I am working with VB.NET. I am using the following regex:

(^(?!").+?|^(?="))(?<Entry>"(",|(.*?)"(?<!((?!").("")+)),))*(?<LastEntry>"("$|(.*?)"(?<!((?!").("")+))$))

I handle newlines by feeding StreamReader.ReadLine's into a string variable until the regex succeeds, replacing the newline with a space (this is OK for my purposes). I then extract the field contents by using Match.Groups("Entry").Captures and Match.Groups("LastEntry").

I suppose the performance hit is coming from the look-behind for escaped quotes. Is there a better way?

Thanks for any ideas!

Upvotes: 2

Views: 308

Answers (1)

Tim Pietzcker
Tim Pietzcker

Reputation: 336098

I think your regex is needlessly complicated, and the nested quantifiers cause catastrophic backtracking. Try the following:

^[^"]*(?<Entry>(?>"(?>[^"]+|"")*"),)*(?<LastEntry>(?>"(?>[^"]+|"")*"))$

Explanation:

^                 # Start of string
[^"]*             # Optional non-quotes
(?<Entry>         # Match group 'entry'
 (?>              # Match, and don't allow backtracking (atomic group):
  "               # a quote
  (?>             # followed by this atomic group:
   [^"]+          # one or more non-quote characters
  |               # or
   ""             # two quotes in a row
  )*              # repeat 0 or more times.
  "               # Then match a closing quote
 )                # End of atomic group
 ,                # Match a comma
)*                # End of group 'entry'
(?<LastEntry>     # Match the final group 'lastEntry'
 (?>              # same as before
  "               # quoted field...
  (?>[^"]+|"")*   # containing non-quotes or double-quotes
  "               # and a closing quote
 )                # exactly once.
)                 # End of group 'lastEntry'
$                 # End of string

This should work on the entire file as well, so you wouldn't have to add one line after the next until the regex matches, and you wouldn't have to replace the newlines:

Dim RegexObj As New Regex("^[^""]*(?<Entry>(?>""(?:[^""]+|"""")*""),)*(?<LastEntry>(?>""(?:[^""]+|"""")*""))$", RegexOptions.Multiline)
Dim MatchResults As Match = RegexObj.Match(SubjectString)
While MatchResults.Success
    ' now you can access MatchResults.Groups("Entry").Captures and
    ' MatchResults.Groups("LastEntry")
    MatchResults = MatchResults.NextMatch()
End While

Upvotes: 1

Related Questions