Reputation: 755
I'm trying to split a CSV file into an array, but encountering two problems: quoted commas and empty elements.
I tried numerous possibilities. I've not found a a regex that works 100%.
The CSV looks like:
123,2.99,AMO024,Title,"Description, more info",,123987564
The regex I tried:
thisLine.split(/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/)
In my output array the fifth element comes out as 123987564 not an empty string.
Upvotes: 66
Views: 164360
Reputation: 1622
I was looking for something for C# and the one that are here were a bit lacking and too long, some even didn't work correctly with match . etc.
So I wrote my version - simple, works for me.
"[^"]+"|[^",]+|,,
C#
string pattern = @"""[^""]+""|[^"",]+|,,";
Note: there are cases that are not working
Upvotes: 0
Reputation: 15000
Instead of using a split, I think it would be easier to simply execute a match and process all the found matches.
This expression will:
Regex: (?:^|,)(?=[^"]|(")?)"?((?(1)[^"]*|[^,"]*))"?(?=,|$)
Sample Text
123,2.99,AMO024,Title,"Description, more info",,123987564
ASP example using the non-java expression
Set regEx = New RegExp
regEx.Global = True
regEx.IgnoreCase = True
regEx.MultiLine = True
sourcestring = "your source string"
regEx.Pattern = "(?:^|,)(?=[^""]|("")?)""?((?(1)[^""]*|[^,""]*))""?(?=,|$)"
Set Matches = regEx.Execute(sourcestring)
For z = 0 to Matches.Count-1
results = results & "Matches(" & z & ") = " & chr(34) & Server.HTMLEncode(Matches(z)) & chr(34) & chr(13)
For zz = 0 to Matches(z).SubMatches.Count-1
results = results & "Matches(" & z & ").SubMatches(" & zz & ") = " & chr(34) & Server.HTMLEncode(Matches(z).SubMatches(zz)) & chr(34) & chr(13)
next
results=Left(results,Len(results)-1) & chr(13)
next
Response.Write "<pre>" & results
Matches using the non-java expression
Group 0 gets the entire substring which includes the comma
Group 1 gets the quote if it's used
Group 2 gets the value not including the comma
[0][0] = 123
[0][1] =
[0][2] = 123
[1][0] = ,2.99
[1][1] =
[1][2] = 2.99
[2][0] = ,AMO024
[2][1] =
[2][2] = AMO024
[3][0] = ,Title
[3][1] =
[3][2] = Title
[4][0] = ,"Description, more info"
[4][1] = "
[4][2] = Description, more info
[5][0] = ,
[5][1] =
[5][2] =
[6][0] = ,123987564
[6][1] =
[6][2] = 123987564
As Boris pointed out CSV format will escape a double quote "
as a double double quote ""
. Although this requirement wasn't included by the OP, if your text includes double double quotes then you'll want to use a this modified expression:
Regex: (?:^|,)(?=[^"]|(")?)"?((?(1)(?:[^"]|"")*|[^,"]*))"?(?=,|$)
See also: https://regex101.com/r/y8Ayag/1
It should also be pointed out that Regex is a pattern matching tool not a parsing engine. Therefore if your text includes double double quotes it will still contain the double double quotes after pattern matching is completed. With this solution you'd still need to search for the double double quotes and replace them in your captured text.
Upvotes: 79
Reputation: 924
I use this expression. It takes into account a space after a comma, which I have come across.
(?:,"|^"|, ")(""|[\w\W]*?)(?=",|"$)|(?:,(?!")|^(?!"))([^,]*?)(?=$|,)|(\r\n|\n)
Upvotes: 1
Reputation: 345
I'm late to the party, but the following is the Regular Expression I use:
(?:,"|^")(""|[\w\W]*?)(?=",|"$)|(?:,(?!")|^(?!"))([^,]*?)(?=$|,)|(\r\n|\n)
This pattern has three capturing groups:
This pattern handles all of the following:
If you have are using a more capable flavor of regex with named groups and lookbehinds, I prefer the following:
(?<quoted>(?<=,"|^")(?:""|[\w\W]*?)*(?=",|"$))|(?<normal>(?<=,(?!")|^(?!"))[^,]*?(?=(?<!")$|(?<!"),))|(?<eol>\r\n|\n)
Edit
(?:^"|,")(""|[\w\W]*?)(?=",|"$)|(?:^(?!")|,(?!"))([^,]*?)(?=$|,)|(\r\n|\n)
This slightly modified pattern handles lines where the first column is empty as long as you are not using Javascript. For some reason Javascript will omit the second column with this pattern. I was unable to correctly handle this edge-case.
Upvotes: 16
Reputation: 2104
Aaaand another answer here. :) Since I couldn't make the others quite work.
My solution both handles escaped quotes (double occurrences), and it does not include delimiters in the match.
Note that I have been matching against '
instead of "
as that was my scenario, but simply replace them in the pattern for the same effect.
Here goes (remember to use the "ignore whitespace" flag /x
if you use the commented version below) :
# Only include if previous char was start of string or delimiter
(?<=^|,)
(?:
# 1st option: empty quoted string (,'',)
'{2}
|
# 2nd option: nothing (,,)
(?:)
|
# 3rd option: all but quoted strings (,123,)
# (included linebreaks to allow multiline matching)
[^,'\r\n]+
|
# 4th option: quoted strings (,'123''321',)
# start pling
'
(?:
# double quote
'{2}
|
# or anything but quotes
[^']+
# at least one occurance - greedy
)+
# end pling
'
)
# Only include if next char is delimiter or end of string
(?=,|$)
Single line version:
(?<=^|,)(?:'{2}|(?:)|[^,'\r\n]+|'(?:'{2}|[^']+)+')(?=,|$)
Upvotes: 3
Reputation: 38751
I needed this answer too, but I found the answers, while informative, a little hard to follow and replicate for other languages. Here is the simplest expression I came up with for a single column out of the CSV line. I am not splitting. I'm building a regex to match a column out of the CSV so I'm not splitting the line:
("([^"]*)"|[^,]*)(,|$)
This matches a single column from the CSV line. The first portion "([^"]*)"
of the expression is to match a quoted entry, the second part [^,]*
is to match a non-quoted entry. Then either followed by a ,
or end of line $
.
And the accompanying debuggex to test out the expression.
https://www.debuggex.com/r/s4z_Qi2gZiyzpAhx
Upvotes: 10
Reputation: 1435
Worked on this for a bit and came up with this solution:
(?:,|\n|^)("(?:(?:"")*[^"]*)*"|[^",\n]*|(?:\n|$))
This solution handles "nice" CSV data like
"a","b",c,"d",e,f,,"g"
0: "a"
1: "b"
2: c
3: "d"
4: e
5: f
6:
7: "g"
and uglier things like
"""test"" one",test' two,"""test"" 'three'","""test 'four'"""
0: """test"" one"
1: test' two
2: """test"" 'three'"
3: """test 'four'"""
Here's an explanation of how it works:
(?:,|\n|^) # all values must start at the beginning of the file,
# the end of the previous line, or at a comma
( # single capture group for ease of use; CSV can be either...
" # ...(A) a double quoted string, beginning with a double quote (")
(?: # character, containing any number (0+) of
(?:"")* # escaped double quotes (""), or
[^"]* # non-double quote characters
)* # in any order and any number of times
" # and ending with a double quote character
| # ...or (B) a non-quoted value
[^",\n]* # containing any number of characters which are not
# double quotes ("), commas (,), or newlines (\n)
| # ...or (C) a single newline or end-of-file character,
# used to capture empty values at the end of
(?:\n|$) # the file or at the ends of lines
)
Upvotes: 47
Reputation: 9
I'm using this one, it works with coma separator and double quote escaping. Normally that's should solved your problem :
/(?<=^|,)(\"(?:[^"]+|"")*\"|[^,]*)(?:$|,)/g
Upvotes: 0
Reputation: 11754
I personally tried many RegEx expressions without having found the perfect one that match all cases.
I think that regular expressions is hard to configure properly to match all cases properly. Although few persons will not like the namespace (and I was part of them), I propose something that is part of the .Net framework and give me proper results all the times in all cases (mainly managing every double quotes cases very well):
Microsoft.VisualBasic.FileIO.TextFieldParser
Found it here: StackOverflow
Example of usage:
TextReader textReader = new StringReader(simBaseCaseScenario.GetSimStudy().Study.FilesToDeleteWhenComplete);
Microsoft.VisualBasic.FileIO.TextFieldParser textFieldParser = new TextFieldParser(textReader);
textFieldParser.SetDelimiters(new string[] { ";" });
string[] fields = textFieldParser.ReadFields();
foreach (string path in fields)
{
...
Hope it could help.
Upvotes: 5
Reputation: 11
This one matches all i need in c#:
(?<=(^|,)(?<quote>"?))([^"]|(""))*?(?=\<quote>(?=,|$))
Upvotes: 0
Reputation: 316
,?\s*'.+?'|,?\s*".+?"|[^"']+?(?=,)|[^"']+
This regex works with single and double quotes and also for one quote inside another!
Upvotes: 0
Reputation: 1273
Yet another answer with a few extra features like support for quoted values that contain escaped quotes and CR/LF characters (single values that span multiple lines).
NOTE: Though the solution below can likely be adapted for other regex engines, using it as-is will require that your regex engine treats multiple named capture groups using the same name as one single capture group. (.NET does this by default)
When multiple lines/records of a CSV file/stream (matching RFC standard 4180) are passed to the regular expression below it will return a match for each non-empty line/record. Each match will contain a capture group named Value
that contains the captured values in that line/record (and potentially an OpenValue
capture group if there was an open quote at the end of the line/record).
Here's the commented pattern (test it on Regexstorm.net):
(?<=\r|\n|^)(?!\r|\n|$) // Records start at the beginning of line (line must not be empty)
(?: // Group for each value and a following comma or end of line (EOL) - required for quantifier (+?)
(?: // Group for matching one of the value formats before a comma or EOL
"(?<Value>(?:[^"]|"")*)"| // Quoted value -or-
(?<Value>(?!")[^,\r\n]+)| // Unquoted value -or-
"(?<OpenValue>(?:[^"]|"")*)(?=\r|\n|$)| // Open ended quoted value -or-
(?<Value>) // Empty value before comma (before EOL is excluded by "+?" quantifier later)
)
(?:,|(?=\r|\n|$)) // The value format matched must be followed by a comma or EOL
)+? // Quantifier to match one or more values (non-greedy/as few as possible to prevent infinite empty values)
(?:(?<=,)(?<Value>))? // If the group of values above ended in a comma then add an empty value to the group of matched values
(?:\r\n|\r|\n|$) // Records end at EOL
(?<=\r|\n|^)(?!\r|\n|$)(?:(?:"(?<Value>(?:[^"]|"")*)"|(?<Value>(?!")[^,\r\n]+)|"(?<OpenValue>(?:[^"]|"")*)(?=\r|\n|$)|(?<Value>))(?:,|(?=\r|\n|$)))+?(?:(?<=,)(?<Value>))?(?:\r\n|\r|\n|$)
Examples on how to use the regex pattern can be found on my answer to a similar question here, or on C# pad here, or here.
Upvotes: 1
Reputation: 631
If you know that you won't have an empty field (,,) then this expression works well:
("[^"]*"|[^,]+)
As in the following example...
Set rx = new RegExp
rx.Pattern = "(""[^""]*""|[^,]+)"
rx.Global = True
Set col = rx.Execute(sText)
For n = 0 to col.Count - 1
if n > 0 Then s = s & vbCrLf
s = s & col(n)
Next
However, if you anticipate an empty field and your text is relatively small than you might consider replacing the empty fields with a space prior to parsing to ensure that they are captured. For example...
...
Set col = rx.Execute(Replace(sText, ",,", ", ,"))
...
And if you need to maintain the integrity of the fields, you can restore the commas and test for empty spaces inside the loop. This may not be the most efficient method but it gets the job done.
Upvotes: 2
Reputation: 538
I created this a few months ago for a project.
".+?"|[^"]+?(?=,)|(?<=,)[^"]+
It works in C# and the Debuggex was happy when I selected Python and PCRE. Javascript doesn't recognize this form of Proceeded By ?<=....
For your values, it will create matches on
123
,2.99
,AMO024
,Title
"Description, more info"
,
,123987564
Note that anything in quotes doesn't have a leading comma, but attempting to match with a leading comma was required for the empty value use case. Once done, trim values as necessary.
I use RegexHero.Net to test my Regex.
Upvotes: 13
Reputation: 1
The correct regular expression to match a single quoted value with escaped [doubled] single quotes in it is:
'([^n']|(''))+'
Upvotes: -12
Reputation: 131
If i try the regex posted by @chubbsondubs on http://regex101.com using the 'g' flag, there are matches, that contain only ',' or an empty string.
With this regex:
(?:"([^"]*)"|([^,]*))(?:[,])
i can match the parts of the CSV (inlcuding quoted parts). (The line must be terminated with a ',' otherwise the last part isn't recognized.)
https://regex101.com/r/dF9kQ8/4
If the CSV looks like:
"",huhu,"hel lo",world,
there are 4 matches:
''
'huhu'
'hel lo'
'world'
Upvotes: -1
Reputation: 28566
In Java this pattern ",(?=([^\"]*\"[^\"]*\")*(?![^\"]*\"))"
almost work for me:
String text = "\",\",\",,\",,\",asdasd a,sd s,ds ds,dasda,sds,ds,\"";
String regex = ",(?=([^\"]*\"[^\"]*\")*(?![^\"]*\"))";
Pattern p = Pattern.compile(regex);
String[] split = p.split(text);
for(String s:split) {
System.out.println(s);
}
output:
","
",a,,"
",asdasd a,sd s,ds ds,dasda,sds,ds,"
Disadvantage: not work, when column have an odd number of quotes :(
Upvotes: 4
Reputation: 6137
I had a similar need for splitting CSV values from SQL insert statements.
In my case, I could assume that strings were wrapped in single quotations and numbers were not.
csv.split(/,((?=')|(?=\d))/g).filter(function(x) { return x !== '';});
For some probably obvious reason, this regex produces some blank results. I could ignore those, since any empty values in my data were represented as ...,'',...
and not ...,,...
.
Upvotes: 0
Reputation: 338228
The advantage of using JScript for classic ASP pages is that you can use one of the many, many libraries that have been written for JavaScript.
Like this one: https://github.com/gkindel/CSV-JS. Download it, include it in your ASP page, parse CSV with it.
<%@ language="javascript" %>
<script language="javascript" runat="server" src="scripts/csv.js"></script>
<script language="javascript" runat="server">
var text = '123,2.99,AMO024,Title,"Description, more info",,123987564',
rows = CSV.parse(line);
Response.Write(rows[0][4]);
</script>
Upvotes: 3