Reputation: 12959
I am having a csv file, quotes around each field.
there are some fields, which can have multiple double quotes inside them. I want to escape each of them with additional double quote.
","ABC "XYZ" PQRS","
","ABC "XYZ"","
","ABC "A" "B" TEST","
","ABC 2.5" "C" Test","
I took help from the link and able to cover for scenarios with single double quote inside content, using regular expression [regex]$r='(","[^"]+"[^"]+?",")'
. But, stuck up in the cases, where there are multiple double quotes inside content.
[regex]$r='(","[^"]+"[^"]+"",")' # Not working
get-content C:\Projects\MyProject\testRegexFordoublequotes.csv | foreach {
#save each line to a variable to make it easier to track
$line=$_
#look for a regex match
$find=$r.matches($line)
if ($find[0].Success) {
foreach ($match in $find) {
#the original string we matched on
$found=$match.value
#replace the substring
$replace= '","'+ $found.Trim('","').Replace('""','"').Replace('"','""')+ '","'
#replace the full string and write to the pipeline
$line -replace $found,$replace
} #foreach
} #if
else {
#no match so write the line to pipeline
$line
}
} | Set-Content C:\Projects\MyProject\modified.csv -Force
Can you please help me in defining regex which will be helpful for multiple double quotes inside field.
Upvotes: 0
Views: 3246
Reputation: 23623
It is probably easier to search for the valid delimiters (e.g. "\s*,\s*"
) and split your lines into fields, and than simply correct each (invalid) single double quote with 2 quotes in each field.
Than rebuild the fields to a record by surrounding the fields with double quotes and join them with the csv
(comma) delimiter
Input
$Csv = @'
"Field","ABC "XYZ" PQRS","Field"
"Field","ABC "XYZ"","Field"
"Field","ABC "A" "B" TEST","Field"
"Field","ABC 2.5" "C" Test","Field"
'@ -Split '[\r\n]+'
Script
$Csv | # replace with: get-content .\testRegexFordoublequotes.csv |
Foreach-Object {
$Line = $_ -Replace '^\s*"' -Replace '"\s*$' # Strip outer double quotes
$Fields = $Line -Split '"\s*,\s*"' # Split line into fields
$Fields = $Fields -Replace '"', '""' # Escape each " in each field
'"' + ($Fields -Join '","') + '"' # Rejoin the fields to line
} # append: | Set-Content .\modified.csv -Force
Output
"Field","ABC ""XYZ"" PQRS","Field"
"Field","ABC ""XYZ""","Field"
"Field","ABC ""A"" ""B"" TEST","Field"
"Field","ABC 2.5"" ""C"" Test","Field"
Upvotes: 2
Reputation: 25001
You may do the following to see what the changes will be:
(Get-Content file.csv) -replace '(?<!^|",)"(?!,"|$)','""'
You can simply pipe to Set-Content
to save the new contents:
(Get-Content file.csv) -replace '(?<!^|",)"(?!,"|$)','""' |
Set-Content file.csv
Explanation:
(?<!^|",)
is a negative lookbehind look for any previous position that is not the beginning of a line (^
) or ",
. (?!,"|$)
is a negative lookahead for any next position that is not the end of a line ($
) or ,"
. If those look around conditions are met, the "
is replaced with ""
.
Upvotes: 1
Reputation: 7616
Based on the conversation we had in the comments of the post, the files are non-conformant CSV files, hence a CSV parser is of no help.
Mind you, you have an undefined case if a single cell happens to have a some textext","more text
. That cell will be treated as two cells because of the unescaped quotes.
Now to the regex. You could find a regex with lookahead and lookbehind, but I think it is easier to blindly double all quotes, then cleanup the unintended ones, e.g. at the start & end of line, and in between cells.
I am not familiar with powershell, but here is a JavaScript/pseudo code that you can easily convert to the powershell syntax. I am using a single line containing all your stated test cases; you would loop through the lines in your file:
/* assume $line is:
"Start","ABC "XYZ" PQRS","ABC "XYZ"","ABC "A" "B" TEST","ABC 2.5" "C" Test","End"
*/
$fixed = $line.replace(/"/g, '""')
.replace(/"",""/g, '","')
.replace(/^""/, '"')
.replace(/""$/, '"')
/* $fixed is:
"Start","ABC ""XYZ"" PQRS","ABC ""XYZ""","ABC ""A"" ""B"" TEST","ABC 2.5"" ""C"" Test","End"
*/
Explanation:
.replace(/"/g, '""')
- blindly double all quotes.replace(/"",""/g, '","')
- restore "",""
back to ","
.replace(/^""/, '"')
- restore ""
at start of line to "
.replace(/""$/, '"')
- restore ""
at end of line to "
Upvotes: 1