Bill Butler
Bill Butler

Reputation: 489

Using sed to clean a CSV file

First of all, I'd like to say that I have searched exhaustively for this solution. It is important that I use sed or at least a mix of *nix command line utilities to solve this. I'm dealing with, in some cases, malformed CSV files, but I'm pretty sure it's solvable. I'm missing just one piece of the puzzle.

I'd like to build a converter from CSV to pipe. It should solve the following issues:

  1. Strip out "," and replace with |
  2. Strip out ", and replace with |
  3. Strip out ," and replace with |
  4. Strip out quotes inside quotes like: dog,"john "bud" smith",cat (becomes dog|john bud smith|cat)
  5. Strip out , which aren't between quotes and replace with |

I've completed almost all of this with a sed command, but I'm stumped with the commas that are within a field. There is likely a better way but I'm running out of creative thought on the topic. A proper solution will parse this string:

1234,"bill","butler","1000,p"r"airie",1234,6789

into

1234|bill|butler|1000,prairie|1234|6789

This is what I have so far:

echo '1234,"bill","butler","1000,p"r"airie",1234,6789' |
sed -e 's/","/|/g' -e 's/,"/|/g' -e 's/",/|/g' -e 's/"//g'

Upvotes: 1

Views: 1248

Answers (3)

Per Boussard
Per Boussard

Reputation: 11

#!/bin/bash                                                                                                                                                                                      

l='1234,"bill","butler","1000,p"r"airie",1234,6789'

has_quote_in_quote()
{
    echo $1 | grep -q '[^,]"[^,]'
}

clean_quote_in_quote ()
{
    echo $1 | sed -E -e 's/([^,])"([^,])/\1\2/g'
}

parse()
{
    echo $1 |grep -E -o  '[^"]*|"[^"]*"'
}

pipe_unquoted_commas()
{
    for f in $(parse $1); do echo $f|sed -E -e '/^[^"]/s/,/|/g'; done
}

while has_quote_in_quote $l; do b=$(clean_quote_in_quote $l); l=$b; done

echo $(printf "%s" $(pipe_unquoted_commas $b|sed 's/"//g'))

Running this yields

1234|bill|butler|1000,prairie|1234|6789

It's not obvious to me that this is what you want, but let me explain how it works.

has_quote_in_quote finds any '"' that is not neighboring a comma. clean_quote_in_quote removes all those it can find, but if they are really close, it needs more than one pass due to sed having advanced past the single quoted character in this case -- so whether by chance or deliberate, you example was really well chosen. parse picks either an unquoted or a quoted stretch of text, including the quotes. The "quoted quotes" are removed in the while-loop and then the commas are transformed in the last line, while remaining quotation-characters are removed.

//P

Upvotes: 1

devnull
devnull

Reputation: 123458

You could use perl. Text::Parsewords to the rescue:

perl -MText::ParseWords -nle 'print join "|", map {s/"//g; $_} parse_line(",",1,$_);' file

For your sample input, it'd produce:

1234|bill|butler|1000,prairie|1234|6789

Upvotes: 3

Nazarii Bardiuk
Nazarii Bardiuk

Reputation: 4342

echo '1234,"bill","butler","1000,p"r"airie",1234,6789' | 
sed -e 's/\([0-9"]\),\([0-9"]\)/\1|\2/g' -e 's/"//g'

I defined a rule:

, is transformed to the | if it is between numbers or quotes

and later just strip out all quotes

EDIT1 Looks like my solution is not working but there is nice thread for this question

Upvotes: 0

Related Questions