Reputation: 53

How to remove duplicate rows and concatenate values from one column in CSV

I want to remove duplicate rows from a CSV and concatenate the values of specific column (in this case, column2).

Input

ID column2 column3 column4, etc....
1  a       test3   test4
1  r       test3   test4
1  c       test3   test4
2  r       test3   test4
2  o       test3   test4
3  a       test3   test4
4  b       test3   test4
4  c       test3   test4
4  e       test3   test4

Expected result

ID column2 column3 column4, etc....
1  a|r|c   test3   test4
2  r|o     test3   test4
3  a       test3   test4
4  b|c|e   test3   test4

Is it possible with awk?

Upvotes: 2

Answers (2)

potong

Reputation: 58558

This might work for you (GNU sed & column):

sed -r '1b;:a;$!N;s/^(\s*\S+\s)(\S+)\s*(\S+\s*\S+\s*)(.*)n\1(\S+)\s*\3/\1\2|\5 \3\4/;ta;P;D' file | column -t

Pattern match on all lines except the first and then format the expected result using back references and the column command.

N.B. The first field is stripped of its white space.

Upvotes: 0

thanasisp

Reputation: 5975

With awk, for variable column, for the general case where all the other columns may change.

awk -v col=2 -v OFS="\t" '{
    temp=$col
    $col=""
    a[$0]=a[$0]? a[$0] "|" temp: temp
}
END {for (i in a) {
        split(i, b)
        for (j=1; j<=length(b); j++) {
            if (j==col) printf a[i] OFS
            printf b[j] OFS
        }
        printf ORS
    }
}' file |sort -n |column -t

This uses an associative array with the line excluding $col as index and append to it the value of $col.

At the END we take care to put $col back to its place while printing, by splitting fields to another array.

The order of the output is undetermined and you can pipe it to sort for any sorting type per any field. And to column -t if you need so.

Upvotes: 1

How to remove duplicate rows and concatenate values from one column in CSV

Answers (2)

Related Questions