CSV remove duplicates in 2nd column without deleting first row

Question

I have a csv with 17 columns and many 1000s of rows. In column 2, I am attempting to delete duplicates, but keep the first.

File example:

1001,Henry
1002,Dave
1003,Dave
1004,Tom

when I run:

sort -t, -k2,2 -u file.csv -o newfile.csv

the newfile.csv contains (wrong)

1001,Henry
1004,Tom

desired output:

1001,Henry
1002,Dave
1004,Tom

I've tried several things with awk as well, no luck. Thanks in advance!

Paras Mishra · Accepted Answer

Try this,

awk -F ',' '!seen[$2]++' file.csv > newfile.csv

This command is telling awk which lines to print. The variable $2 holds the entire contents of column 2 and square brackets are array access. So, for each second column of line in filename, the node of the array named seen is incremented and the line printed if the content of that node(column2) was not (!) previously set.

CSV remove duplicates in 2nd column without deleting first row

Answers (1)

Related Questions