user7079832
user7079832

Reputation:

drop duplicates and keep first in a csv file in unix

Hi I have a csv file whose content is like

NAME,AGE
abc,12
def,13
NAME,AGE  ##here duplicates :though these are column names
sdd,34
krgj,656

I tried a sort command to do that as:

sort -u file.csv -o file.csv

but all the duplicate rows got dropped(kept the last one ), but i need to keep the first one , so that I can have my column/header safe.

Please help in this regards.

Upvotes: 0

Views: 1168

Answers (3)

glenn jackman
glenn jackman

Reputation: 246877

The idiomatic awk program for this task is:

awk '!seen[$0]++' file

For each line ($0) in the file, we increment the number of times we've seen that line. Since we're using the post-increment operator, the first time a line is encountered, the value of seen[$0]++ is zero. For all other instances of that line, the value is non-zero. So we negate the value to get a true value for the first time seen. The default action is to print the line.

Upvotes: 1

agc
agc

Reputation: 8406

Using datamash's non-sorting deduplication line filter "rmdup", (requires datamash v1.0.7 or newer):

datamash rmdup 1 < source.csv

Output:

NAME,AGE
abc,12
def,13
sdd,34
krgj,656

Upvotes: 0

Tim
Tim

Reputation: 2187

This isn't the most elegant solution but it works.

head -n1 source.csv > output.csv; grep -v "$(head -n1 source.csv)" source.csv >> output.csv

It works by writing > the first line to output.csv then removing all the first lines using grep -v and appending >> the result to output.csv

Example:

root@merlin:/tmp# cat source.csv 
NAME,AGE
abc,12
def,13
NAME,AGE
sdd,34
krgj,656
root@merlin:/tmp# head -n1 source.csv > output.csv; grep -v "$(head -n1 source.csv)" source.csv >> output.csv
root@merlin:/tmp# cat output.csv 
NAME,AGE
abc,12
def,13
sdd,34
krgj,656

If you need to dedup it as well:

head -n1 source.csv > output.csv; grep -v "$(head -n1 source.csv)" source.csv |sort -u >> output.csv

Upvotes: 0

Related Questions