deostroll
deostroll

Reputation: 11995

regex to extract all that in quotes

I am trying to write a regex to match all strings which appear in between enclosing characters (most likely " - double quotes). This is a scenario I commonly encounter while trying to parse a line in a csv file.

So I have a sample line like:

"Smith, John",25,"21/45, North Avenue",IBM

Tried the following regex:

"(.*)"

But it fetches somewhat as follows:

http://regexr.com?37ie3

I am expecting output as follows:

Smith, John
25
21/45, North Avenue
IBM

The regex I have written is an attempt to capture what comes between " in my example. However, above is the output I am expecting.

There is a kind of ambiguity though: I am not looking for a match like: ,25,. This kinda makes me wonder if a regex is even feasible here.

What is the correct way to write this?

Upvotes: 0

Views: 126

Answers (3)

qstebom
qstebom

Reputation: 739

Firstly, that will only capture one group. Secondly, you need to be non-greedy:

(?:"(.*?)")

This does not solve your problem of multiple matches in a single line. Here are two examples:

import re
string = '"Smith, John",25,"21/45, North Avenue",IBM'
pattern = r'(?:"(.*?)")'
re.findall(pattern, string)
> ['Smith, John', '21/45, North Avenue']

In C#:

string pattern = @"(?:\""(.*?)\"")";
string input = @"\""Smith, John\"",25,\""21/45, North Avenue\"",IBM'";
foreach (Match m in Regex.Matches(input, pattern)) 
    Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);

Upvotes: 0

Tim Pietzcker
Tim Pietzcker

Reputation: 336478

If you really want to roll your own CSV parser, you'll need to teach your regex a few rules:

  1. A field may be unquoted as long as it doesn't contains quotes, commas or newlines.
  2. A quoted field may contain any characters; quotes are escaped by doubling.
  3. Commas are used as separators.

So, to match one CSV field, you can use the following regex:

(?mx)       # Verbose, multiline mode
(?<=^|,)    # Assert there is a comma or start of line before the current position.
(?:         # Start non-capturing group:
 "          # Either match an opening quote, followed by
 (?:        # a non-capturing group:
  ""        #  Either an escaped quote
 |          #  or
  [^"]+     #  any characters except quotes
 )*         # End of inner non-capturing group, repeat as needed.
 "          # Match a closing quote.
|           # OR
 [^,"\r\n]+ # Match any number of characters except commas, quotes or newlines
)           # End of outer non-capturing group
(?=,|$)     # Assert there is a comma or end-of-line after the current position

See it live on regex101.com.

Upvotes: 1

Amarnath Balasubramanian
Amarnath Balasubramanian

Reputation: 9460

Please don't use regex for this, CSV should be handled by a parser.

Here is a ready-to-use parser: http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader

You can also use the OLEDB built-in parser: http://www.switchonthecode.com/tutorials/csharp-tutorial-using-the-built-in-oledb-csv-parser

Hope this helps

Upvotes: 1

Related Questions