MylesXD
MylesXD

Reputation: 25

Remove comma in unexected column within csv [awk,sed]

Is there a simple way to remove a comma within a csv that isn't suppose to be there using AWK/ SED?

The .csv is made up of 5 columns, but that additional comma makes some rows come up with 6 columns. For example see below, the third column i am expecting to be number, but as you can see row 3 and 4 are strings.

4,abc,323,123,acvd
4,abc,323,123,acvd
4,ab,c,323,123,acvd
4,db,c,323,123,acvd

I didn't want to over engineer the solution which i could PROBABLY use pandas / python.

It is also a very large file 5GB file, million + rows so isn't something i can open in excel.

Desired output would be to add the addition string or remove it i.e

4,abc,323,123,acvd
4,abc,323,123,acvd
4,abc,323,123,acvd
4,dbc,323,123,acvd

4,abc,323,123,acvd
4,abc,323,123,acvd
4,ab,323,123,acvd
4,db,323,123,acvd

Upvotes: 1

Views: 729

Answers (3)

Ed Morton
Ed Morton

Reputation: 203209

Since removing the extra field is an acceptable option for you:

$ awk 'BEGIN{FS=OFS=","} {print $1, $2, $(NF-2), $(NF-1), $NF}' file
4,abc,323,123,acvd
4,abc,323,123,acvd
4,ab,323,123,acvd
4,db,323,123,acvd

otherwise:

$ awk 'BEGIN{FS=OFS=","} {print $1, $2 (NF>5 ? $3 : ""), $(NF-2), $(NF-1), $NF}' file
4,abc,323,123,acvd
4,abc,323,123,acvd
4,abc,323,123,acvd
4,dbc,323,123,acvd

Upvotes: 5

anubhava
anubhava

Reputation: 784958

This simple awk script would also do your job:

awk 'BEGIN {FS=OFS=","} NF==6 {$2 = $2 $3; $3=""; sub(/,,/, ",")} 1' file.csv

4,abc,323,123,acvd
4,abc,323,123,acvd
4,abc,323,123,acvd
4,dbc,323,123,acvd

Upvotes: 3

Timur Shtatland
Timur Shtatland

Reputation: 12347

Use this Perl one-liner:

perl -F',' -lane 'if ( @F > 5 ) { $_ = join ",", $F[0], "$F[1]$F[2]", @F[3..$#F]; } print;' in_file

Output:

4,abc,323,123,acvd
4,abc,323,123,acvd
4,abc,323,123,acvd
4,dbc,323,123,acvd

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array @F on whitespace or on the regex specified in -F option.
-F',' : Split into @F on comma, rather than on whitespace.

SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches

Upvotes: 1

Related Questions