Reputation: 97
I have content of input file in linux, something like
"test1",2,2,,0,"Sun,day",Jan
"be,st3",1,0,,0,"Sunday",Feb1
"nest",0,0,,0,"Sunday",Jul
"rest,5",,,,0,"Sunday",Aug
Need Output like
test1|2|2||0|Sun,day|Jan
be,st3|1|0||0|Sunday|Feb1
nest|0|0||0|Sunday|Jul
rest,5||||0|Sunday|Aug
used tr command to replace the , with | it replaces the fieldvalue as well. I am not able to understand how to only change the delimiter not the values. can someone please provide some pointers how and which commands i can use to perform this task ?
Upvotes: 3
Views: 687
Reputation: 113834
Using sed:
$ sed -E ':a; s/^(([^"]*("[^"]*")?)*),/\1|/; ta; s/"//g' file
test1|2|2||0|Sun,day|Jan
be,st3|1|0||0|Sunday|Feb1
nest|0|0||0|Sunday|Jul
rest,5||||0|Sunday|Aug
This changes ,
to |
if the ,
occurs after an even number of "
.
:a
This defines a label a
.
s/^(([^"]*("[^"]*")?)*),/\1|/
Starting from the beginning of the line, ^
, this looks for any number of the following sequence:
1. `[^"]*` = zero or more non-quotes
2. `("[^"]*")?` = pairs of quotes
ta
If the previous s
command resulted in a successful substitution, then jump back to label a
and try again.
s/"//g
After we have replaced all the unquoted commas with vertical bars, we remove the quotes.
As potong points out, an alternative and simpler solution is:
sed -E 's/(([^,"]*("[^"]*")*)*),/\1|/g;s/"//g' file
This works because of two subtleties: (1) sed's regexes look for the left-most longest match, and (2) when doing a global (g) substitution, a subsequent match is not allowed to overlap the prior match. With these two rules in mind, this command replaces ,
with |
only after even numbers of quotes.
Upvotes: 3
Reputation: 140168
That's a difficult task to achieve with sed
.
In python (version 3.x), on the other hand, done with a few lines:
import csv
with open("input.csv") as fr:
with open("output.csv","w",newline='') as fw: # uncomment for python 3.x
with open("output.csv","wb") as fw: # python 2.x only
cr = csv.reader(fr,delimiter=",")
cw = csv.writer(fw,delimiter="|")
cw.writerows(cr)
How it works: It just uses the amazing built-in csv
module. read with a separator, write with another.
okay, now just for fun ... my sed
solution if you really want to know
create a sedfile like this:
s/"\([^",]\+\)"/\1/g
s/"\([^"]\+\),\([^"]\+\)"/\1%\2/g
s/,/\|/g
s/%/,/g
apply it sed -f sedfile.txt input.csv > output.csv
how it works:
result:
test1|2|2||0|Sun,day|Jan
be,st3|1|0||0|Sunday|Feb1
nest|0|0||0|Sunday|Jul
rest,5||||0|Sunday|Aug
it's limited to 1 coma per protected field (can be extended to 3 or more...) and fields must not use the %
sign.
Upvotes: 2
Reputation: 23667
$ cat ip.txt
"test1",2,2,,0,"Sun,day",Jan
"be,st3",1,0,,0,"Sunday",Feb1
"nest",0,0,,0,"Sunday",Jul
"rest,5",,,,0,"Sunday",Aug
xyz,"12,234","a,b","abc",893,23,"Sep"
$ perl -pe 's/"(?<a>[^"]+)",|(?<a>[^,]*),/$+{a}|/g' ip.txt
test1|2|2||0|Sun,day|Jan
be,st3|1|0||0|Sunday|Feb1
nest|0|0||0|Sunday|Jul
rest,5||||0|Sunday|Aug
xyz|12,234|a,b|abc|893|23|"Sep"
"(?<a>[^"]+)",
text inside ""
(followed immediately by ,
) is captured inside named group a
(?<a>[^,]*),
text other than ,
ending with ,
is also captured inside named group a
%+
"(?<a>[^"]+)",
is first finished before the alternate regex is evaluated"Sep"
, the quotes won't be removed
Further Reading: perlre - search for named capture
Upvotes: 1