Hitin
Hitin

Reputation: 442

Regular expression with special character inside CSV with quoted matches

I tried and this is the best I could come up with, need a fresh eye or help to finish this job.

Expression:

\"[a-zA-Z\s0-9\.\']*\"

Input String Data:

BOL,"AWBH0876356","HMM","H0010","BEANR","BEANR","AEJEA","BHBAH","","","T","S","","","F","N","","FCL/FCL","BE","","","","","","","","SUNNYLAND DISTRIBUTION NV","EVERDONGENLAAN 12 2300 TURNHOUT","","","INTERNATIONAL AGENCIES CO LTD","BUILDING 406, ROAD 4308, BLOCK 343, MANAMA BAHRAIN","","INTERNATIONAL AGENCIES CO LTD","BUILDING 406, ROAD 4308, BLOCK 343, MANAMA BAHRAIN","","","","","","","N/A","770000",""SHIPPER'S LOAD & COUNT, SAID TO BE:" 1X20'DC CONTAINER S.T.C 1650 CARTON OF JUICES FREIGHT PREPAID","1650","CARTONS","CTN","","","1","1","2.2","21615.0","23815.0","0","0","0","0","","",""

I need to ignore the first word (BOL) and comma, which is working, but I am stuck with matches which have special character (',") in them.

Below match is an issue, for example:

""SHIPPER'S LOAD & COUNT, SAID TO BE:" 1X20'DC CONTAINER S.T.C 1650 CARTON OF JUICES FREIGHT PREPAID"

Upvotes: 0

Views: 509

Answers (2)

Jerry
Jerry

Reputation: 71538

The current problem with your regex (and the string to be parsed) is that it doesn't accept that there is a quote inside a value and the string has. Maybe, you could specify that there can be a quote inside, but then, the closing quote should only be before a comma or at the end of the string, and you can do that with a positive lookhead:

".*?"(?=,|$)

regex101 demo

".*?" matches the value while (?=,|$) ensures there's either a comma after it or the end of the string (indicated by $).

Note that if your string has value containing a quote character followed by a comma, the above regex won't work.

In that case, what I usually do is count the number of matches. If this exceeds the number I expect, I put the original line separately so I can look at them one by one (that will involve some manual intervention, but that's better than ending up with lots of errors!).

If all the problems come from a single column, then you could change your script so that it will 'merge' values from i to j (i being the column number where the first issue occurs and j being the next one) until there's an appropriate number of values.

Upvotes: 2

vks
vks

Reputation: 67968

(?:^|(?<=,))"(?!,")(.+?)"(?=,"|$)

Try this.See demo.

https://regex101.com/r/tJ2mW5/4

Upvotes: 0

Related Questions