Sharun
Sharun

Reputation: 2018

Regex to detect invalid rows in csv file with numbers containing commas

Below are sample rows in my file. Each row has 3 numeric fields. Some numbers have commas and are quoted. And some rows have empty excess commas (file is output from extracting tabular data from pdfs) I am trying to detect and remove excess empty fields to the left or right of a valid 3 field match.

"3,33","4,03","12,23" - valid  
2,,"3,23"             - valid  
,,"3,13",1,6 - invalid - remove first 2 commas   
,1,,"3,987", - invalid - remove comma at beginning and end  

This is what I have - (([0-9,]*,){3}) But struggling to work out how to match the quotes which may or may not surround the number.

Basically I am not clear about how if a quote is detected to then look for an ending quote and if not, to not look for any ending quote. Thanks for any help!

(I am using sublimetext 3 to do the search and replace but I could do it in Python if required.)

Upvotes: 2

Views: 785

Answers (2)

Thomas Nelson
Thomas Nelson

Reputation: 683

If you are using python, I'd suggest the csv library:

data = '''"3,33","4,03","12,23"
2,,"3,23"
,,"3,13",1,6
,1,,"3,987",'''.split('\n')
import csv
for row in csv.reader(data):
    print(row)

Output:

['3,33', '4,03', '12,23']
['2', '', '3,23']
['', '', '3,13', '1', '6']
['', '1', '', '3,987', '']

But if you want to do a regex, you can:

(([0-9]*)|("[0-9,]*"),){3}

Using the vertical pipe | for "or" so either a sequence of digits [0-9]* or digits and commas surrounded by quotes "[0-9,]*".

Upvotes: 2

Jan Schejbal
Jan Schejbal

Reputation: 4033

It seems like you field content may be either:

  • empty
  • an integer \d+
  • a quoted decimal number "\d+,\d+"

This can be expressed as (|\d+|"\d+,\d+"). You want three of these, comma-separated:

(|\d+|"\d+,\d+"),(|\d+|"\d+,\d+"),(|\d+|"\d+,\d+")

(you could mess around with repetition using (|\d+|"\d+,\d+")(,(|\d+|"\d+,\d+")){2}, but just having the same thing three times is easier to understand).

Note that this would just assume the last field as "empty" as that is good enough to match your definition. So we need to anchor it to the beginning and end of the line:

^(|\d+|"\d+,\d+"),(|\d+|"\d+,\d+"),(|\d+|"\d+,\d+")$

This will match the valid lines only.

According to your explanation, there may be extra commas at the beginning and end, but no other characters, so that would be:

^,*(|\d+|"\d+,\d+"),(|\d+|"\d+,\d+"),(|\d+|"\d+,\d+"),*$

This now matches all lines, but doesn't give you what you actually care about, the "good" part in the middle. For that, you can use a group (using parentheses). We are already using groups though, for the | alternatives, so we need to add ?: to make them non-capturing:

^,*(?:|\d+|"\d+,\d+"),(?:|\d+|"\d+,\d+"),(?:|\d+|"\d+,\d+"),*$

Now, you can wrap the part you care about in a capturing group, giving us the final regexp:

^,*((?:|\d+|"\d+,\d+"),(?:|\d+|"\d+,\d+"),(?:|\d+|"\d+,\d+")),*$

Search & Replace, in regexp mode, with this regexp for the search and \1 for the replacement value.

There can be multiple possible solutions if multiple fields are empty. If the first and last of your three values is never empty, you can adjust the regexp accordingly. If they are, this will pick one set of valid values, which may not be the one you want. For example, if you have:

,,1,"2,3",,

Then both ,1,"2,3 (empty, 1, 2.3) and 1,"2,3", (1, 2.3, empty) would be possible solutions. By default you will get the second result, because the * is greedy, i.e. the first ,* will try to "eat" as many commas as it can without making the match fail. You can add a question mark like this: ,*? to make the match "lazy" or "non-greedy", then it will only take as many commas as it needs to (yielding the first result).

Upvotes: 4

Related Questions