RatVader
RatVader

Reputation: 23

Is there an easy way to remove duplicate fields in a CSV?

I have a CSV that has dates and names, I want to sort out duplicate dates, but leave all the names by using bash tools. Eg:

Input CSV:

01/01/1960,"Torvalds, Linus"
01/01/1960,"Gates, Bill"
04/20/1960,"Woz, Steve"

Output CSV:

01/01/1960,"Torvalds, Linus"
,"Gates, Bill"
04/20/1960,"Woz, Steve"

I know I could do this using loops and line numbers, but I'm wondering if there's a better way. Perhaps using sort or uniq?

Upvotes: 2

Views: 58

Answers (1)

P....
P....

Reputation: 18371

Using sort and awk :

sort -t',' -k1 input.csv |awk -F, -v OFS=, '($1 in a) {gsub($1,"") } !($1 in a) {a[$1]=$1}1'
01/01/1960,"Gates, Baleee"
,"Gates, Benny"
,"Gates, Bill"
,"Torvalds, Linus"
04/20/1960,"Wozniac, pitor"
,"Woz, Steve"

When the input.csv is as follow:

 -->cat input.csv
01/01/1960,"Torvalds, Linus"
01/01/1960,"Gates, Bill"
01/01/1960,"Gates, Benny"
04/20/1960,"Woz, Steve"
01/01/1960,"Gates, Baleee"
04/20/1960,"Wozniac, pitor"

Explanation: Here, FS is set to comma and for each line first column is stored in an array named a if its not already present and if its already present then replace the first column with blank using gsub function. sort is used to bring dates in sequence.

Upvotes: 1

Related Questions