Reputation: 489
First of all, I'd like to say that I have searched exhaustively for this solution. It is important that I use sed
or at least a mix of *nix command line utilities to solve this. I'm dealing with, in some cases, malformed CSV files, but I'm pretty sure it's solvable. I'm missing just one piece of the puzzle.
I'd like to build a converter from CSV to pipe. It should solve the following issues:
","
and replace with |
",
and replace with |
,"
and replace with |
dog,"john "bud" smith",cat
(becomes dog|john bud smith|cat
),
which aren't between quotes and replace with |
I've completed almost all of this with a sed
command, but I'm stumped with the commas that are within a field. There is likely a better way but I'm running out of creative thought on the topic. A proper solution will parse this string:
1234,"bill","butler","1000,p"r"airie",1234,6789
into
1234|bill|butler|1000,prairie|1234|6789
This is what I have so far:
echo '1234,"bill","butler","1000,p"r"airie",1234,6789' |
sed -e 's/","/|/g' -e 's/,"/|/g' -e 's/",/|/g' -e 's/"//g'
Upvotes: 1
Views: 1248
Reputation: 11
#!/bin/bash
l='1234,"bill","butler","1000,p"r"airie",1234,6789'
has_quote_in_quote()
{
echo $1 | grep -q '[^,]"[^,]'
}
clean_quote_in_quote ()
{
echo $1 | sed -E -e 's/([^,])"([^,])/\1\2/g'
}
parse()
{
echo $1 |grep -E -o '[^"]*|"[^"]*"'
}
pipe_unquoted_commas()
{
for f in $(parse $1); do echo $f|sed -E -e '/^[^"]/s/,/|/g'; done
}
while has_quote_in_quote $l; do b=$(clean_quote_in_quote $l); l=$b; done
echo $(printf "%s" $(pipe_unquoted_commas $b|sed 's/"//g'))
Running this yields
1234|bill|butler|1000,prairie|1234|6789
It's not obvious to me that this is what you want, but let me explain how it works.
has_quote_in_quote finds any '"' that is not neighboring a comma. clean_quote_in_quote removes all those it can find, but if they are really close, it needs more than one pass due to sed having advanced past the single quoted character in this case -- so whether by chance or deliberate, you example was really well chosen. parse picks either an unquoted or a quoted stretch of text, including the quotes. The "quoted quotes" are removed in the while-loop and then the commas are transformed in the last line, while remaining quotation-characters are removed.
//P
Upvotes: 1
Reputation: 123458
You could use perl
. Text::Parsewords
to the rescue:
perl -MText::ParseWords -nle 'print join "|", map {s/"//g; $_} parse_line(",",1,$_);' file
For your sample input, it'd produce:
1234|bill|butler|1000,prairie|1234|6789
Upvotes: 3
Reputation: 4342
echo '1234,"bill","butler","1000,p"r"airie",1234,6789' |
sed -e 's/\([0-9"]\),\([0-9"]\)/\1|\2/g' -e 's/"//g'
I defined a rule:
, is transformed to the | if it is between numbers or quotes
and later just strip out all quotes
EDIT1 Looks like my solution is not working but there is nice thread for this question
Upvotes: 0