Peaceful_Warrior
Peaceful_Warrior

Reputation: 59

Linux Bash commands to remove duplicates from a CSV file

I will be combining a number of CSV files. What I am trying to do is to:

1) Remove duplicate rows from the file, however, I need to check multiple columns as the criteria for what consists as a duplicate. How do I do that?

2) It would be nice to then create a 2nd output file to see what was removed in case something was removed that was not supposed to be removed.

3) Create a list of items as an input file to run as a (if this row contains this word in a this particular column, then remove the entire row.

If someone could help me with the commands to do this, that would be great! Please let me know if I need to clarify.

Here is a sample of what the data looks like (here is an example as suggested):

I have a csv file like this :

column1    column2

john       kerry
adam       stephenson
ashley     hudson
john       kerry
etc..

I want to remove duplicates from this file, to get only for the question at 1:

column1    column2

john       kerry
adam       stephenson
ashley     hudson

For question 3, I want to take the 2nd list...meaning the output of the 1st list and scrub this futher. I want a file like input.txt that contains:

adam

Then, the final output will be:

column1    column2

john       kerry
ashley     hudson

So, the input.txt file in the example contains the word adam (this way I can make a long list of words to check in the input.txt file). For #3, I need a code snipet that will check column 1 of all lines of the CSV for all the words input file, then remove any matches from the csv.

Upvotes: 3

Views: 10079

Answers (1)

jaypal singh
jaypal singh

Reputation: 77105

You need to provide more details for question 3, but for question 1 and 2 the following awk one-liner will work.

awk 'seen[$0]++{print $0 > "dups.csv"; next}{print $0 > "new.csv"}' mycsv

And with some whitespace added for clarity:

awk 'seen[$0]++ {
  print $0 > "dups.csv"; next
}
{
  print $0 > "new.csv"
}' mycsv

This will not print anything to STDOUT but will create two files. dups.csv will contain all the duplicates (that is if there are 5 entries of same line, this file will contain 4 entries that were removed as dups) that were removed and new.csv will contain all unique rows.

seen[$0]++ is a test we do for each line. If the line is present in our array it will be inserted to dups.csv file and we will move to the next line using next. If line is not present we will add that line to the array and write it to new.csv file.

Use of $0 means entire line. If you want to specify fewer columns, you can do so. You just need to set the input field separator based on delimiter. You have mentioned csv but I don't see any comma delimiters so I am using the default separator which is [[:space:]]+.

Also, it is comma separated, I was just putting sample data up. So, if I want to use the above example but want to test only columns 3 & 4 (using the seen command), how would I do that in a comma separated file?

For true csv just set the field separator to ,. seen is not a command. It is a hash that retains column as keys. So you will modify the above command to:

awk -F, 'seen[$3,$4]++{print $0 > "dups.csv"; next}{print $0 > "new.csv"}' mycsv

Update:

Once you have a list without dups using the commands stated above. We are left with:

$ cat new.csv 
john,kerry
adam,stephenson
ashley,hudson

$ cat remove.txt 
adam

$ awk -F, 'NR==FNR{remove[$1]++;next}!($1 in remove)' remove.txt new.csv 
john,kerry
ashley,hudson

Upvotes: 8

Related Questions