Reputation: 496
I have a csv file like this: (named test2.csv)
lastname,firstname,83494989,1997-05-20,2015-05-07 15:30:43,Sentence Skills 104,Sentence Skills 104,Elementary Algebra 38,Elementary Algebra 38,Sentence Skills 104,Sentence Skills 104,Elementary Algebra 38,Elementary Algebra 38,
I want to remove the duplicate entries
The closest I have got is the following awk command
awk '{a[$0]++} END {for (i in a) print RS i}' RS="," test2.csv
it works but causes new problems, it take the values out of order and puts them in rows like this:
,Elementary Algebra 38
,2015-05-07 15:30:43
,Sentence Skills 104
,FirstName
,LastName
,1997-05-20
,83494989
I need to keep the order they are in and keep them in one line ( I can fix the row issue, but don't know how to fix the order issue)
Update with Solution:
The answer from anubhava worked great, I added a question about removing the time from the date and Ed Morton helped out with that, here is the full query
awk 'BEGIN{RS=ORS=","} {sub(/ ..:..:..$/,"")} !seen[$0]++' test2.csv
Upvotes: 3
Views: 2543
Reputation: 366
Another way to achieve the same
Change commas to new lines
sed -e 's/,/\n/g'
Remove duplicate lines with sort -u
sed -e 's/,/\n/g' | sort -u
And finally replace new lines back to commas
sed -e 's/,/\n/g' | sort -u | sed -e 's/\n/,/g'
Upvotes: 0
Reputation: 136
This worked for me:
test='lastname,firstname,83494989,1997-05-20,2015-05-07 15:30:43,Sentence Skills 104,Sentence Skills 104,Elementary Algebra 38,Elementary Algebra 38,Sentence Skills 104,Sentence Skills 104,Elementary Algebra 38,Elementary Algebra 38,'
echo $test |tr ',' '\n' | sort -u | grep . | tr '\n' ','
Explanation:
This is probably not as elegant or as fast as other approaches, but it's easier for me to understand.
Upvotes: 0
Reputation: 786091
You can just use this awk:
awk 'BEGIN{RS=ORS=","} !seen[$0]++' test2.csv
lastname,firstname,83494989,1997-05-20,2015-05-07 15:30:43,Sentence Skills 104,Elementary Algebra 38,
Upvotes: 9