Reputation:
Hi I have a csv file whose content is like
NAME,AGE
abc,12
def,13
NAME,AGE ##here duplicates :though these are column names
sdd,34
krgj,656
I tried a sort command to do that as:
sort -u file.csv -o file.csv
but all the duplicate rows got dropped(kept the last one ), but i need to keep the first one , so that I can have my column/header safe.
Please help in this regards.
Upvotes: 0
Views: 1168
Reputation: 246877
The idiomatic awk program for this task is:
awk '!seen[$0]++' file
For each line ($0) in the file, we increment the number of times we've seen that line. Since we're using the post-increment operator,
the first time a line is encountered, the value of seen[$0]++
is zero. For all other instances of that line, the value is non-zero. So we negate the value to get a true value for the first time seen. The default action is to print the line.
Upvotes: 1
Reputation: 8406
Using datamash's non-sorting deduplication line filter "rmdup", (requires datamash
v1.0.7 or newer):
datamash rmdup 1 < source.csv
Output:
NAME,AGE
abc,12
def,13
sdd,34
krgj,656
Upvotes: 0
Reputation: 2187
This isn't the most elegant solution but it works.
head -n1 source.csv > output.csv; grep -v "$(head -n1 source.csv)" source.csv >> output.csv
It works by writing >
the first line to output.csv
then removing all the first lines using grep -v
and appending >>
the result to output.csv
Example:
root@merlin:/tmp# cat source.csv
NAME,AGE
abc,12
def,13
NAME,AGE
sdd,34
krgj,656
root@merlin:/tmp# head -n1 source.csv > output.csv; grep -v "$(head -n1 source.csv)" source.csv >> output.csv
root@merlin:/tmp# cat output.csv
NAME,AGE
abc,12
def,13
sdd,34
krgj,656
If you need to dedup it as well:
head -n1 source.csv > output.csv; grep -v "$(head -n1 source.csv)" source.csv |sort -u >> output.csv
Upvotes: 0