Reputation: 53
I want to remove duplicate rows from a CSV and concatenate the values of specific column (in this case, column2).
Input
ID column2 column3 column4, etc....
1 a test3 test4
1 r test3 test4
1 c test3 test4
2 r test3 test4
2 o test3 test4
3 a test3 test4
4 b test3 test4
4 c test3 test4
4 e test3 test4
Expected result
ID column2 column3 column4, etc....
1 a|r|c test3 test4
2 r|o test3 test4
3 a test3 test4
4 b|c|e test3 test4
Is it possible with awk?
Upvotes: 2
Views: 590
Reputation: 58371
This might work for you (GNU sed & column):
sed -r '1b;:a;$!N;s/^(\s*\S+\s)(\S+)\s*(\S+\s*\S+\s*)(.*)n\1(\S+)\s*\3/\1\2|\5 \3\4/;ta;P;D' file | column -t
Pattern match on all lines except the first and then format the expected result using back references and the column command.
N.B. The first field is stripped of its white space.
Upvotes: 0
Reputation: 5965
With awk
, for variable column, for the general case where all the other columns may change.
awk -v col=2 -v OFS="\t" '{
temp=$col
$col=""
a[$0]=a[$0]? a[$0] "|" temp: temp
}
END {for (i in a) {
split(i, b)
for (j=1; j<=length(b); j++) {
if (j==col) printf a[i] OFS
printf b[j] OFS
}
printf ORS
}
}' file |sort -n |column -t
This uses an associative array with the line excluding $col
as index and append to it the value of $col
.
At the END
we take care to put $col
back to its place while printing, by splitting fields to another array.
The order of the output is undetermined and you can pipe it to sort
for any sorting type per any field. And to column -t
if you need so.
Upvotes: 1