Reputation: 47
I need to parse a CSV using regex and one particular column needs to either be a string in quotes or NULL (as a string not a null value).
I can get the column if it is in quotes using \,("[^"]*")
but any attempt to fetch it if it's NULL instead is proving beyond me. I assumed it'd be something like \,(("[^"]*")|(NULL))
but that's causing weird results.
To sum the problem up, it needs to match:
"Foo"
NULL
but not:
bar
edit
If I use the value "This is a string, include it", the match is rejected (it gets accepted with just ("[^"]*")
). NULL gets accepted, but it doesn't return the string 'NULL' which isn't the behaviour I need)
Here's the full regex: as it appears in the code:
@fields = $line =~ /^
(\d{0,10}+)
\,(\d{0,10}+)
\,([0-9\.]{0,6}+)
\,([0-9\.]{0,6}+)
\,([^,]*)
\,([^,]*)
\,(\d*\.?\d*)
\,(\d*\.?\d*)
\,([^,]*)
\,([^,]*)
\,([^,]*)
\,([^,]*)
\,([^,]*)
\,(\w{3}+)
\,(\w{3}+)
\,([^,]*)
\,([^,]*)
\,(\w{0,10})
\,(\d+)
\,([^,]*)
\,(\d{1}+)
\,(("[^"]*")|(NULL))
\,([^,]*)
\,([^,]*)
$
/xo;
Here's a sample line (sorry if it's nonsensical):
1111,111111,0,0,This is some text,1111.11,0.00,0.00,2014-03-14 11:11:1111.111,Text,2014-03-11 11:11:11.111,Text,Text,LLL,AAA,1900-01-01 00:00:00.000,1900-01-01 23:59:59.000,NULL,0,2014-03-11 11:00:11.111,1,NULL,1111111,NULL
Output:
1111
111111
0
0
This is some text
1111.11
0.00
0.00
2014-03-14 11:11:1111.111
Text
2014-03-11 11:11:11.111
Text
Text
LLL
AAA
1900-01-01 00:00:00.000
1900-01-01 23:59:59.000
NULL
0
2014-03-11 11:00:11.111
1
NULL
NULL
1111111
NULL
It looks like its returning 3 values for the \,(("[^"]*")|(NULL)) match : NULL, an empty string and NULL when it should just return a single NULL.
If I enclose the important NULL (third from last value) in quotes I get the following output:
1111
111111
0
0
This is some text
1111.11
0.00
0.00
2014-03-14 11:11:1111.111
Text
2014-03-11 11:11:11.111
Text
Text
LLL
AAA
1900-01-01 00:00:00.000
1900-01-01 23:59:59.000
NULL
0
2014-03-11 11:00:11.111
1
"NULL"
"NULL"
1111111
NULL
So that also outputs 3 values instead of the single "NULL" it should output
Upvotes: 2
Views: 267
Reputation: 71598
Change the this part of your regex:
(("[^"]*")|(NULL))
to:
("[^"]*"|NULL)
You were having 3 capture groups up there. First contained ("[^"]*")|(NULL)
, second contained "[^"]*"
and third contained NULL
, and if you had NULL
, you would be having NULL
in the first capture group, an empty second capture group and NULL
in the third capture group.
With my suggestion, you should have only one capture group having either "[^"]*"
or NULL
.
Upvotes: 2