moore1emu
moore1emu

Reputation: 496

How to remove duplicate comma separate strings using awk

I have a csv file like this: (named test2.csv)

lastname,firstname,83494989,1997-05-20,2015-05-07 15:30:43,Sentence Skills 104,Sentence Skills 104,Elementary Algebra 38,Elementary Algebra 38,Sentence Skills 104,Sentence Skills 104,Elementary Algebra 38,Elementary Algebra 38,

I want to remove the duplicate entries

The closest I have got is the following awk command

awk '{a[$0]++} END {for (i in a) print RS i}' RS="," test2.csv

it works but causes new problems, it take the values out of order and puts them in rows like this:

,Elementary Algebra 38
,2015-05-07 15:30:43
,Sentence Skills 104
,FirstName
,LastName
,1997-05-20
,83494989

I need to keep the order they are in and keep them in one line ( I can fix the row issue, but don't know how to fix the order issue)

Update with Solution:

The answer from anubhava worked great, I added a question about removing the time from the date and Ed Morton helped out with that, here is the full query

awk 'BEGIN{RS=ORS=","} {sub(/ ..:..:..$/,"")} !seen[$0]++' test2.csv

Upvotes: 3

Views: 2543

Answers (3)

Daniel J.
Daniel J.

Reputation: 366

Another way to achieve the same

Change commas to new lines

sed -e 's/,/\n/g'

Remove duplicate lines with sort -u

sed -e 's/,/\n/g' | sort -u

And finally replace new lines back to commas

sed -e 's/,/\n/g' | sort -u | sed -e 's/\n/,/g'

Upvotes: 0

user3407335
user3407335

Reputation: 136

This worked for me:

test='lastname,firstname,83494989,1997-05-20,2015-05-07 15:30:43,Sentence Skills 104,Sentence Skills 104,Elementary Algebra 38,Elementary Algebra 38,Sentence Skills 104,Sentence Skills 104,Elementary Algebra 38,Elementary Algebra 38,'

echo $test |tr ',' '\n' | sort -u | grep . | tr '\n' ','

Explanation:

  1. For readability, set the variable "test" to the comma delimited string.
  2. Change the commas to newlines using tr command.
  3. Remove duplicates using 'sort -u.'
  4. Remove extra blank lines using 'grep.'
  5. Change the newlines back to commas using the tr command.

This is probably not as elegant or as fast as other approaches, but it's easier for me to understand.

Upvotes: 0

anubhava
anubhava

Reputation: 786091

You can just use this awk:

awk 'BEGIN{RS=ORS=","} !seen[$0]++' test2.csv
lastname,firstname,83494989,1997-05-20,2015-05-07 15:30:43,Sentence Skills 104,Elementary Algebra 38,

Upvotes: 9

Related Questions