Reputation: 41

Using SED or AWK to remove all quotes in a specific CSV column

I have a file with a bunch of CSV lines with values with and without quotes like so :

"123","456",,17,"hello," how are you this, fine, highly caffienated morning,","2018-05-29T18:58:10-05:00","XYZ", 
"345","737",,16,"Heading to a "meeting", unprepared while trying to be "awake","2018-05-29T18:58:10-05:00","ACD",

The fifth column is a text column which has escaped or unescaped double quotes. I am trying to get rid of all the quotes in this column so it looks like this

"123","456",,17,"hello, how are you this, fine, highly caffeinated morning,","2018-05-29T18:58:10-05:00","XYZ", 
"345","737",,16,"Heading to a meeting, unprepared while trying to be awake","2018-05-29T18:58:10-05:00","ACD",

Any ideas how to achieve this using SED or AWK, or any other unix tools? Much appreciated!

Upvotes: 0

Answers (4)

naurel

Reputation: 625

Try this regex :

,\d{2}\,(.*),\"\S{25}\",\"\w{3}"

It was made based on your examples. The goal is just to capture de fifth column. Like @Jerry Jeremiah suggested the point was to use the date wich will always be 25 char long. To prevent some missmatch I've also taken in account the 2 digits presents before the fifth and the 3 letters/digit after the date. Regex101v1

We can also use a "stronger" regex by looking for the exact date match

,\d{2}\,(.*),\"\d{4}-\d{2}-\d{2}\w\d{2}:\d{2}:\d{2}-\d{2}:\d{2}\",\"\w{3}"

Regex101v2

With theses regex you'll be able to extract the fifth column using group. To go deeper in your question you can do this in bash :

regex='^(.*,[0-9]{2}\,")(.*)(",\"[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}-[0-9]{2}:[0-9]{2}\",\"[a-zA-Z]{3}".*$)'
while IFS= read -r line
do
    if [[ $line =~ $regex ]]
    then
        before=${BASH_REMATCH[1]}
        fifth=${BASH_REMATCH[2]}
        after=${BASH_REMATCH[3]}
        reworked_fifth="${fifth//\"}"
        echo ${before}${reworked_fifth}${after}
    else
        echo "Line didnt match the regex"
  fi
done < /my/file/path

I had to change the regex since my bash didn't take \d and \w. No need to sed or awk anything with this. Bash can handle it alone.

Upvotes: 0

Corentin Limier

Reputation: 5016

With awk, you can do something like this that avoid very complex regex. The fact that only the fifth column is broken, that the previous columns do not contain commas, and that we know there are a fixed number of columns make it easy to repair :

Edited using gsub for portability as suggested by Ed Morton

awk '
    BEGIN{FS=OFS=","}
    {
        for(i=6; i<=NF-3;i++){
            $5 = $5 FS $i
        }
    }
    {
         gsub(/"/, "", "g", $5)
    }
    {print $1,$2,$3,$4,"\""$5"\"",$(NF-2),$(NF-1),$NF}
    ' <file>

Output :

"123","456",,17,"hello, how are you this, fine, highly caffienated morning,","2018-05-29T18:58:10-05:00","XYZ", 
"345","737",,16,"Heading to a meeting, unprepared while trying to be awake","2018-05-29T18:58:10-05:00","ACD",

If you want to escape quotes, you can use this :

awk '
    BEGIN{FS=OFS=","}
    {
        for(i=6; i<=NF-3;i++){
            $5 = $5 FS $i
        }
    }
    {
         gsub(/^"|"$/,"",$5);
         gsub(/"/,"\\\"",$5);
         $5="\""$5"\"";
    }
    {print $1,$2,$3,$4,$5,$(NF-2),$(NF-1),$NF}
    ' <file>

Output :

"123","456",,17,"hello,\" how are you this, fine, highly caffienated morning,","2018-05-29T18:58:10-05:00","XYZ", 
"345","737",,16,"Heading to a \"meeting\", unprepared while trying to be \"awake","2018-05-29T18:58:10-05:00","ACD",

Upvotes: 2

Ed Morton

Reputation: 204558

With GNU awk for the 3rd arg to match() and assuming you know how many fields there should be in each line:

$ cat tst.awk
BEGIN {
    numFlds  = 8
    badFldNr = 5
}
match($0,"^(([^,]*,){"badFldNr-1"})(.*)((,[^,]*){"numFlds-badFldNr"})",a) {
    gsub(/"/,"",a[3])
    print a[1] "\"" a[3] "\"" a[4]
}

$ awk -f tst.awk file
"123","456",,17,"hello, how are you this, fine, highly caffienated morning,","2018-05-29T18:58:10-05:00","XYZ",
"345","737",,16,"Heading to a meeting, unprepared while trying to be awake","2018-05-29T18:58:10-05:00","ACD",

With other awks you can do the same with a couple of calls to match() and variables instead of the array.

Upvotes: 0

kvantour

Reputation: 26571

Your question is very difficult to answer in a generic way. To give an example:

 "a","b","c","d"

How is this interpreted (if we remove the quotes from the fields of interest):

"a","b","c","d"  (4 fields)
"a,b","c","d"    (3 fields, $1 messed up)
"a","b,c","d"    (3 fields, $2 messed up)
"a","b","c,d"    (3 fields, $3 messed up)
"a,b,c","d"      (2 fields, $1 messed up)
"a,b","c,d"      (2 fields, $1 and $2 messed up)
"a","b,c,d"      (2 fields, $2 messed up)
"a,b,c,d"        (1 field , $1 messed up)

The only way this can be solved is by having the following knowledge:

How many fields does my CSV have
There is maximum one fields messed up
We know which field is messed up

The following awk program will help you fix it:

$ awk 'BEGIN{ere="[^,]*|\042[^\042]"}
       { head=tail=""; mid=$0 }
       # extract the head which is correct
       (n>1) {
          ere_h="^"
          for(i=1;i<n;++i) ere_h = ere_h (ere_h=="^" ? "",",") "(" ere ")"
          match(mid,ere_h); head=substr(mid,RSTART,RLENGTH)
          mid = substr(mid,RLENGTH+1)
       }
       # extract the tail which is correct
       (nf>n) {
          ere_t="$"
          for(i=n+1;i<=nf;++i) ere_t = "(" ere ")" (ere_h=="$" ? "",",") ere_t
          match(mid,ere_t); tail=substr(mid,RSTART,RLENGTH)
          mid = substr(mid,1,RSTART-1)
       }
       # correct the mid part
       { gsub(/\042/,"",mid)
         mid = (mid ~ /^,/) ? ( ",\042" substr(mid,2) ) : ( "\042" mid )
         mid = (mid ~ /,$/) ? ( substr(mid,1,length(mid)-1) "\042," ) : (mid "\042" )
       }
       # print the stuff
       { print head mid tail }' n=5 nf=7 file

Upvotes: 0

Using SED or AWK to remove all quotes in a specific CSV column

Answers (4)

Related Questions