gospes
gospes

Reputation: 3937

Replace column value in csv file with (g)awk with delimiter containing strings

I'm using gawk 4.0.1 and I know how to replace a column value in a CSV file, for example:

> ROW='1,2,3,4,5,6'
> echo $ROW | gawk -F, -vOFS=, '$2="X"'
1,X,3,4,5,6

However, I am dealing with a file which has strings containing the delimiter. Reading a column goes just fine, but when replacing a value, an extra delimiter is inserted:

> ROW='1,"2,3",4,5,6'
> echo $ROW | gawk -vOFS=, -vFPAT='[^,]*|"[^"]*"' '{print $2}'
"2,3"
> echo $ROW | gawk -vOFS=, -vFPAT='[^,]*|"[^"]*"' '$2="X"'
1,X,,4,5,6

This is what I was expecting:

> echo $ROW | gawk -vOFS=, -vFPAT='[^,]*|"[^"]*"' '$2="X"'
1,X,4,5,6

The value '"2,3"' is replaced by 'X,'. How can I solve this?

EDIT: I did not include that I have empty fields as well. A better row example would therefore be:

ROW='1,,"2,3",4,5,6'

EDIT 2: From Dawg's answer I gather it's not possible in pure awk. Although I agree a solution with python is better, the only solution with awk is to include some pre- and post processing to deal with empty fields.

#/bin/bash
ROW='1,,"2,3",4,"",5'
for col in {1..6}; do 
    echo $ROW |\ 
        sed 's:,,:, ,:' |\ 
        gawk -v c=$col -v OFS=, -v FPAT='([^,]+)|("[^\"]*")' '$c="X"' |\
        sed 's:, ,:,,:g'
done

Output:

X,,"2,3",4,"",5
1,X,"2,3",4,"",5
1,,X,4,"",5
1,,"2,3",X,"",5
1,,"2,3",4,X,5
1,,"2,3",4,"",X

Upvotes: 3

Views: 1053

Answers (3)

dawg
dawg

Reputation: 103774

$ echo $ROW | awk -vOFS=, -vFPAT="([^,]+)|(\"[^\"]+\")" '$2="X"'
1,X,4,5,6

I used the pattern from GNU Awk Manual 4.7 Defining Fields By Content

Compare with the * in the same pattern:

$ echo $ROW | awk -vOFS=, -vFPAT="([^,]*)|(\"[^\"]*\")" '$2="X"'
1,X,,4,5,6

So answer is -- (to this limited example) -- use -vFPAT="([^,]+)|(\"[^\"]+\")", but then that does not work with empty fields like 1,"2,3",4,,"","should be 6th field"

Here is the result with both kinds of empty fields (,, and ""):

$ echo $ROW2 | awk -vOFS=, -vFPAT="([^,]+)|(\"[^\"]+\")" '$2="X"'
1,X,4,"","should be 6th field"
      ^^                    - missing the ',,' field
            ^^^             - now the 5th field  -- BUG!

By convention, ROW2 should be treated as having 6 fields with the blank fields ,, and "" each counting as 1 field. If you do not count blank fields as fields, you will loose count of which field is which after the blanks. Add to the list of complications of CSV with an awk regex.

Know that CSV is surprisingly complicated and to handle the many possibilities is not trivial with awk or a regex alone.

Another solution for CSV is to use Perl or Python with the more sophisticated and standardized CSV libraries they can use. In the case of Python, it is part of the standard distribution of Python.

Here is a Python solution that would be completely compatible with RFC 4180

$ echo $ROW | python -c '
> import csv, fileinput
> for line in csv.reader(fileinput.input()):
> print ",".join(e if i!=1 else "X" for i, e in enumerate(line))'
1,X,4,5,6

Which allows more complicated CSV to be handled easily.

Here is 4 record by 5 field CSV with CRLF's in the quoted fields, escaped quotes in the quoted fields, and both kinds of blank fields (,, and "").

1,"2,3",4,5,6
"11,12",13,14,15,16
21,"22,
23",24,25,"26
27"
31,,"33\"not 32\"","",35

With that same script (using repr to see the complete field values, but you would probably use str in normal circumstances) all these cases are handled correctly according to RFC 4180:

$ cat /tmp/3.csv | python -c '
import csv, fileinput
for line in csv.reader(fileinput.input()):
   print ",".join(repr(e) if i!=1 else "X" for i, e in enumerate(line))'
'1',X,'4','5','6'
'11,12',X,'14','15','16'
'21',X,'24','25','26\n27'
'31',X,'33\\not 32\\""','','35'

This is difficult with awk since \n defines each record, we do not handle empty fields correctly, and do not treat escaped quotes correctly:

$ cat /tmp/3.csv | awk -vOFS=, -vFPAT='[^,]+|"[^"]*"' '$2="X"'
1,X,4,5,6
"11,12",X,14,15,16
21,X
23",X,25,"26
27",X
31,X,"",35

Now you would need to redefine RS to a regex that finds quotes around CR and read multiples lines with awk... Add support for escaped quotes... Do a more sophisticated regex to split the fields... Complicated... Good luck!

Upvotes: 2

karthick Sundaram
karthick Sundaram

Reputation: 33

  1. $ echo $ROW | gawk -vOFS=, -vFPAT='[^,]+|"[^"]."' '$2="X"'

. should be after [^"]

  1. echo $ROW | gawk -vOFS=, -vFPAT='[^,]+|"[^"].*"' '$2="X"'

These 2 answers produce the output of 1,x,4,5,6 for ROW='1,"2,3",4,5,6'

Upvotes: 0

karthick Sundaram
karthick Sundaram

Reputation: 33

The output is for

$ ROW='1,"2,3",4,5,6' 
$ echo $ROW | gawk -vOFS=, -vFPAT='[^,]+|"[^"].*"' '$2="X"'
1,X,4,5,6

Both of these commands are working fine. In the second command, * was missed while pasting here.

perl:

$var='1,"2,3",4,5,6';
$var=~s/\".*\"/X/g;
print $var;

Upvotes: 0

Related Questions