user1031348
user1031348

Reputation: 47

Regex to match either any quoted string or a specific unquoted string

I need to parse a CSV using regex and one particular column needs to either be a string in quotes or NULL (as a string not a null value).

I can get the column if it is in quotes using \,("[^"]*") but any attempt to fetch it if it's NULL instead is proving beyond me. I assumed it'd be something like \,(("[^"]*")|(NULL)) but that's causing weird results.

To sum the problem up, it needs to match:

"Foo"

NULL

but not:

bar

edit

If I use the value "This is a string, include it", the match is rejected (it gets accepted with just ("[^"]*") ). NULL gets accepted, but it doesn't return the string 'NULL' which isn't the behaviour I need)

Here's the full regex: as it appears in the code:

@fields = $line =~ /^
        (\d{0,10}+)
        \,(\d{0,10}+)
        \,([0-9\.]{0,6}+)
        \,([0-9\.]{0,6}+)
        \,([^,]*)
        \,([^,]*)       
        \,(\d*\.?\d*)
        \,(\d*\.?\d*)   
        \,([^,]*)
        \,([^,]*)
        \,([^,]*)
        \,([^,]*)
        \,([^,]*)
        \,(\w{3}+)
        \,(\w{3}+)
        \,([^,]*)
        \,([^,]*)
        \,(\w{0,10})
        \,(\d+)
        \,([^,]*)           
        \,(\d{1}+)
        \,(("[^"]*")|(NULL))
        \,([^,]*)   
        \,([^,]*)   
        $
    /xo;

Here's a sample line (sorry if it's nonsensical):

1111,111111,0,0,This is some text,1111.11,0.00,0.00,2014-03-14 11:11:1111.111,Text,2014-03-11 11:11:11.111,Text,Text,LLL,AAA,1900-01-01 00:00:00.000,1900-01-01 23:59:59.000,NULL,0,2014-03-11 11:00:11.111,1,NULL,1111111,NULL

Output:

1111
111111
0
0
This is some text
1111.11
0.00
0.00
2014-03-14 11:11:1111.111
Text
2014-03-11 11:11:11.111
Text
Text
LLL
AAA
1900-01-01 00:00:00.000
1900-01-01 23:59:59.000
NULL
0
2014-03-11 11:00:11.111
1
NULL

NULL
1111111
NULL

It looks like its returning 3 values for the \,(("[^"]*")|(NULL)) match : NULL, an empty string and NULL when it should just return a single NULL.

If I enclose the important NULL (third from last value) in quotes I get the following output:

1111
111111
0
0
This is some text
1111.11
0.00
0.00
2014-03-14 11:11:1111.111
Text
2014-03-11 11:11:11.111
Text
Text
LLL
AAA
1900-01-01 00:00:00.000
1900-01-01 23:59:59.000
NULL
0
2014-03-11 11:00:11.111
1
"NULL"
"NULL"

1111111
NULL

So that also outputs 3 values instead of the single "NULL" it should output

Upvotes: 2

Views: 267

Answers (1)

Jerry
Jerry

Reputation: 71598

Change the this part of your regex:

(("[^"]*")|(NULL))

to:

("[^"]*"|NULL)

You were having 3 capture groups up there. First contained ("[^"]*")|(NULL), second contained "[^"]*" and third contained NULL, and if you had NULL, you would be having NULL in the first capture group, an empty second capture group and NULL in the third capture group.

With my suggestion, you should have only one capture group having either "[^"]*" or NULL.

Upvotes: 2

Related Questions