Venkataraman R
Venkataraman R

Reputation: 12959

Having multiple double quotes inside quoted string csv file

I am having a csv file, quotes around each field.

there are some fields, which can have multiple double quotes inside them. I want to escape each of them with additional double quote.

","ABC "XYZ" PQRS","
","ABC "XYZ"","
","ABC "A" "B" TEST","
","ABC 2.5" "C" Test","

I took help from the link and able to cover for scenarios with single double quote inside content, using regular expression [regex]$r='(","[^"]+"[^"]+?",")'. But, stuck up in the cases, where there are multiple double quotes inside content.

[regex]$r='(","[^"]+"[^"]+"",")' # Not working
get-content C:\Projects\MyProject\testRegexFordoublequotes.csv | foreach {

  #save each line to a variable to make it easier to track

  $line=$_

  #look for a regex match

  $find=$r.matches($line)
  
  if ($find[0].Success) { 

      foreach ($match in $find) {

        #the original string we matched on

        $found=$match.value

        #replace the substring

        $replace= '","'+  $found.Trim('","').Replace('""','"').Replace('"','""')+ '","'

        #replace the full string and write to the pipeline

        $line -replace $found,$replace

      } #foreach
       

  } #if

  else {

        #no match so write the line to pipeline

        $line

    }

 } | Set-Content C:\Projects\MyProject\modified.csv -Force

Can you please help me in defining regex which will be helpful for multiple double quotes inside field.

Upvotes: 0

Views: 3246

Answers (3)

iRon
iRon

Reputation: 23623

It is probably easier to search for the valid delimiters (e.g. "\s*,\s*") and split your lines into fields, and than simply correct each (invalid) single double quote with 2 quotes in each field.
Than rebuild the fields to a record by surrounding the fields with double quotes and join them with the csv (comma) delimiter

Input

$Csv = @'
"Field","ABC "XYZ" PQRS","Field"
"Field","ABC "XYZ"","Field"
"Field","ABC "A" "B" TEST","Field"
"Field","ABC 2.5" "C" Test","Field"
'@ -Split '[\r\n]+'

Script

$Csv | # replace with: get-content .\testRegexFordoublequotes.csv |
Foreach-Object {
    $Line = $_ -Replace '^\s*"' -Replace '"\s*$' # Strip outer double quotes
    $Fields = $Line -Split '"\s*,\s*"'           # Split line into fields
    $Fields = $Fields -Replace '"', '""'         # Escape each " in each field
    '"' + ($Fields -Join '","') + '"'            # Rejoin the fields to line
} # append: | Set-Content .\modified.csv -Force

Output

"Field","ABC ""XYZ"" PQRS","Field"
"Field","ABC ""XYZ""","Field"
"Field","ABC ""A"" ""B"" TEST","Field"
"Field","ABC 2.5"" ""C"" Test","Field"

Upvotes: 2

AdminOfThings
AdminOfThings

Reputation: 25001

You may do the following to see what the changes will be:

(Get-Content file.csv) -replace '(?<!^|",)"(?!,"|$)','""'

You can simply pipe to Set-Content to save the new contents:

(Get-Content file.csv) -replace '(?<!^|",)"(?!,"|$)','""' |
    Set-Content file.csv

Explanation:

(?<!^|",) is a negative lookbehind look for any previous position that is not the beginning of a line (^) or ",. (?!,"|$) is a negative lookahead for any next position that is not the end of a line ($) or ,". If those look around conditions are met, the " is replaced with "".

Upvotes: 1

Peter Thoeny
Peter Thoeny

Reputation: 7616

Based on the conversation we had in the comments of the post, the files are non-conformant CSV files, hence a CSV parser is of no help.

Mind you, you have an undefined case if a single cell happens to have a some textext","more text. That cell will be treated as two cells because of the unescaped quotes.

Now to the regex. You could find a regex with lookahead and lookbehind, but I think it is easier to blindly double all quotes, then cleanup the unintended ones, e.g. at the start & end of line, and in between cells.

I am not familiar with powershell, but here is a JavaScript/pseudo code that you can easily convert to the powershell syntax. I am using a single line containing all your stated test cases; you would loop through the lines in your file:

/* assume $line is:
"Start","ABC "XYZ" PQRS","ABC "XYZ"","ABC "A" "B" TEST","ABC 2.5" "C" Test","End"
*/

$fixed = $line.replace(/"/g, '""')
              .replace(/"",""/g, '","')
              .replace(/^""/, '"')
              .replace(/""$/, '"')

/* $fixed is:
"Start","ABC ""XYZ"" PQRS","ABC ""XYZ""","ABC ""A"" ""B"" TEST","ABC 2.5"" ""C"" Test","End"
*/

Explanation:

  • .replace(/"/g, '""') - blindly double all quotes
  • .replace(/"",""/g, '","') - restore "","" back to ","
  • .replace(/^""/, '"') - restore "" at start of line to "
  • .replace(/""$/, '"') - restore "" at end of line to "

Upvotes: 1

Related Questions