Reputation: 11995
I am trying to write a regex to match all strings which appear in between enclosing characters (most likely "
- double quotes). This is a scenario I commonly encounter while trying to parse a line in a csv file.
So I have a sample line like:
"Smith, John",25,"21/45, North Avenue",IBM
Tried the following regex:
"(.*)"
But it fetches somewhat as follows:
I am expecting output as follows:
Smith, John
25
21/45, North Avenue
IBM
The regex I have written is an attempt to capture what comes between "
in my example. However, above is the output I am expecting.
There is a kind of ambiguity though: I am not looking for a match like: ,25,
. This kinda makes me wonder if a regex is even feasible here.
What is the correct way to write this?
Upvotes: 0
Views: 126
Reputation: 739
Firstly, that will only capture one group. Secondly, you need to be non-greedy:
(?:"(.*?)")
This does not solve your problem of multiple matches in a single line. Here are two examples:
import re
string = '"Smith, John",25,"21/45, North Avenue",IBM'
pattern = r'(?:"(.*?)")'
re.findall(pattern, string)
> ['Smith, John', '21/45, North Avenue']
In C#:
string pattern = @"(?:\""(.*?)\"")";
string input = @"\""Smith, John\"",25,\""21/45, North Avenue\"",IBM'";
foreach (Match m in Regex.Matches(input, pattern))
Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
Upvotes: 0
Reputation: 336478
If you really want to roll your own CSV parser, you'll need to teach your regex a few rules:
So, to match one CSV field, you can use the following regex:
(?mx) # Verbose, multiline mode
(?<=^|,) # Assert there is a comma or start of line before the current position.
(?: # Start non-capturing group:
" # Either match an opening quote, followed by
(?: # a non-capturing group:
"" # Either an escaped quote
| # or
[^"]+ # any characters except quotes
)* # End of inner non-capturing group, repeat as needed.
" # Match a closing quote.
| # OR
[^,"\r\n]+ # Match any number of characters except commas, quotes or newlines
) # End of outer non-capturing group
(?=,|$) # Assert there is a comma or end-of-line after the current position
See it live on regex101.com.
Upvotes: 1
Reputation: 9460
Please don't use regex for this, CSV should be handled by a parser.
Here is a ready-to-use parser: http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader
You can also use the OLEDB built-in parser: http://www.switchonthecode.com/tutorials/csharp-tutorial-using-the-built-in-oledb-csv-parser
Hope this helps
Upvotes: 1