user1479571
user1479571

Reputation: 97

how to replace delimiter when delimiter also present in field data at any place in a file?

I have content of input file in linux, something like

"test1",2,2,,0,"Sun,day",Jan
"be,st3",1,0,,0,"Sunday",Feb1
"nest",0,0,,0,"Sunday",Jul
"rest,5",,,,0,"Sunday",Aug

Need Output like

test1|2|2||0|Sun,day|Jan
be,st3|1|0||0|Sunday|Feb1
nest|0|0||0|Sunday|Jul
rest,5||||0|Sunday|Aug

used tr command to replace the , with | it replaces the fieldvalue as well. I am not able to understand how to only change the delimiter not the values. can someone please provide some pointers how and which commands i can use to perform this task ?

Upvotes: 3

Views: 687

Answers (3)

John1024
John1024

Reputation: 113834

Using sed:

$ sed -E ':a; s/^(([^"]*("[^"]*")?)*),/\1|/; ta; s/"//g' file
test1|2|2||0|Sun,day|Jan
be,st3|1|0||0|Sunday|Feb1
nest|0|0||0|Sunday|Jul
rest,5||||0|Sunday|Aug

How it works

This changes , to | if the , occurs after an even number of ".

  • :a

    This defines a label a.

  • s/^(([^"]*("[^"]*")?)*),/\1|/

    Starting from the beginning of the line, ^, this looks for any number of the following sequence:

    1. `[^"]*`  = zero or more non-quotes
    
    2. `("[^"]*")?`  = pairs of quotes
    
  • ta

    If the previous s command resulted in a successful substitution, then jump back to label a and try again.

  • s/"//g

    After we have replaced all the unquoted commas with vertical bars, we remove the quotes.

Simplification

As potong points out, an alternative and simpler solution is:

sed -E 's/(([^,"]*("[^"]*")*)*),/\1|/g;s/"//g' file

This works because of two subtleties: (1) sed's regexes look for the left-most longest match, and (2) when doing a global (g) substitution, a subsequent match is not allowed to overlap the prior match. With these two rules in mind, this command replaces , with | only after even numbers of quotes.

Upvotes: 3

Jean-François Fabre
Jean-François Fabre

Reputation: 140168

That's a difficult task to achieve with sed.

In python (version 3.x), on the other hand, done with a few lines:

import csv

with open("input.csv") as fr:
    with open("output.csv","w",newline='') as fw:  # uncomment for python 3.x
    with open("output.csv","wb") as fw:            # python 2.x only
        cr = csv.reader(fr,delimiter=",")
        cw = csv.writer(fw,delimiter="|")
        cw.writerows(cr)

How it works: It just uses the amazing built-in csv module. read with a separator, write with another.

okay, now just for fun ... my sed solution if you really want to know

create a sedfile like this:

s/"\([^",]\+\)"/\1/g
s/"\([^"]\+\),\([^"]\+\)"/\1%\2/g
s/,/\|/g
s/%/,/g

apply it sed -f sedfile.txt input.csv > output.csv

how it works:

  • strips the quotes from the simple fields
  • strips the quotes from the fields containing commas, but replace commas by percent signs
  • change the separator
  • change back percent signs by comas

result:

test1|2|2||0|Sun,day|Jan
be,st3|1|0||0|Sunday|Feb1
nest|0|0||0|Sunday|Jul
rest,5||||0|Sunday|Aug

it's limited to 1 coma per protected field (can be extended to 3 or more...) and fields must not use the % sign.

Upvotes: 2

Sundeep
Sundeep

Reputation: 23667

$ cat ip.txt 
"test1",2,2,,0,"Sun,day",Jan
"be,st3",1,0,,0,"Sunday",Feb1
"nest",0,0,,0,"Sunday",Jul
"rest,5",,,,0,"Sunday",Aug
xyz,"12,234","a,b","abc",893,23,"Sep"

$ perl -pe 's/"(?<a>[^"]+)",|(?<a>[^,]*),/$+{a}|/g' ip.txt 
test1|2|2||0|Sun,day|Jan
be,st3|1|0||0|Sunday|Feb1
nest|0|0||0|Sunday|Jul
rest,5||||0|Sunday|Aug
xyz|12,234|a,b|abc|893|23|"Sep"
  • "(?<a>[^"]+)", text inside "" (followed immediately by ,) is captured inside named group a
  • (?<a>[^,]*), text other than , ending with , is also captured inside named group a
  • The named group is accessible via the hash %+
  • The order is important - "(?<a>[^"]+)", is first finished before the alternate regex is evaluated
  • Note: If final element is quoted, say "Sep", the quotes won't be removed


Further Reading: perlre - search for named capture

Upvotes: 1

Related Questions