Reputation: 13
I've been searching, browsing stackoverflow RegEx solutions until I'm buggy-eyed.
I have a third-party solution that is delivering a "tab delimited text file", but I have determined whatever is generating this file is embedding double-quotes (and not even escaping them) - and I wonder if its possible to scrub these errors out of the file with RegEx (I am using FNR on the file prior to import).
Each row of data contains 14 columns, tab-delimited, with double-quotes around each field as expected. All of the defects are occurring in field 2 or field 10 (not first or last field), because these are training courses - some instructors use names having double-quotes in the name itself - and this third-party report doesn't escape them. I am hoping to catch the TAB before and after the correct quotation marks - and filter any internal erroneous double-quotes or escape them properly with \".
"......" "ADC000000" "Being the "Best" you can be" "2F8A776C" "...."
"......" "BBC555555" ""Golden Opportunity"" "8F4C3DEE" "...."
desired output
"......" "ADC000000" "Being the \"Best\" you can be" "2F8A776C" "...."
"......" "BBC555555" "\"Golden Opportunity\"" "8F4C3DEE" "...."
or (whichever is easiest, and pretty fast the files have 220,000 rows in them and only 16-50 errors)
"......" "ADC000000" "Being the Best you can be" "2F8A776C" "...."
"......" "BBC555555" "Golden Opportunity" "8F4C3DEE" "...."
sorry about the verbosity of this. But I wanted the problem to be as clear as possible.
Upvotes: 1
Views: 352
Reputation:
Split on the tabs then strip off leading and trailing quotes:
line.split('\t').map(function(field) { return field.replace(/^"|"$/, ''); })
In general, it seems people are trying to do to much regexps that could be more easily done with other approaches such as splitting and scanning.
Upvotes: 0
Reputation: 174696
You could use the below regex to match "
which are not preceeded by tab or start of the line and not followed by a tab or the end of a line anchor.
(?<!\t|^)"(?!\t|$)
Then replace the matched "
with \\"
.
Upvotes: 1
Reputation: 57854
You can match any quote that is both preceded and followed by characters that are not tabs:
s/([^\t])"([^\t])/$1\\"$2/g
(The $1 and $2 put the matched preceding and following character back in the substituted string. Exact syntax may vary depending on your regex engine.)
If your regex engine supports it, you can use lookbehind and lookahead to make it a bit simpler:
s/(?<!\t)"(?!\t)/\\"/g
Upvotes: 0