Reputation: 59
I will be combining a number of CSV files. What I am trying to do is to:
1) Remove duplicate rows from the file, however, I need to check multiple columns as the criteria for what consists as a duplicate. How do I do that?
2) It would be nice to then create a 2nd output file to see what was removed in case something was removed that was not supposed to be removed.
3) Create a list of items as an input file to run as a (if this row contains this word in a this particular column, then remove the entire row.
If someone could help me with the commands to do this, that would be great! Please let me know if I need to clarify.
Here is a sample of what the data looks like (here is an example as suggested):
I have a csv file like this :
column1 column2
john kerry
adam stephenson
ashley hudson
john kerry
etc..
I want to remove duplicates from this file, to get only for the question at 1:
column1 column2
john kerry
adam stephenson
ashley hudson
For question 3, I want to take the 2nd list...meaning the output of the 1st list and scrub this futher. I want a file like input.txt that contains:
adam
Then, the final output will be:
column1 column2
john kerry
ashley hudson
So, the input.txt file in the example contains the word adam (this way I can make a long list of words to check in the input.txt file). For #3, I need a code snipet that will check column 1 of all lines of the CSV for all the words input file, then remove any matches from the csv.
Upvotes: 3
Views: 10079
Reputation: 77105
You need to provide more details for question 3, but for question 1 and 2 the following awk
one-liner will work.
awk 'seen[$0]++{print $0 > "dups.csv"; next}{print $0 > "new.csv"}' mycsv
And with some whitespace added for clarity:
awk 'seen[$0]++ {
print $0 > "dups.csv"; next
}
{
print $0 > "new.csv"
}' mycsv
This will not print anything to STDOUT but will create two files. dups.csv
will contain all the duplicates (that is if there are 5 entries of same line, this file will contain 4 entries that were removed as dups) that were removed and new.csv
will contain all unique rows.
seen[$0]++
is a test we do for each line. If the line is present in our array it will be inserted to dups.csv
file and we will move to the next line using next
. If line is not present we will add that line to the array and write it to new.csv
file.
Use of $0
means entire line. If you want to specify fewer columns, you can do so. You just need to set the input field separator based on delimiter. You have mentioned csv
but I don't see any comma
delimiters so I am using the default separator which is [[:space:]]+
.
Also, it is comma separated, I was just putting sample data up. So, if I want to use the above example but want to test only columns 3 & 4 (using the seen command), how would I do that in a comma separated file?
For true csv
just set the field separator to ,
. seen
is not a command. It is a hash that retains column as keys. So you will modify the above command to:
awk -F, 'seen[$3,$4]++{print $0 > "dups.csv"; next}{print $0 > "new.csv"}' mycsv
Update:
Once you have a list without dups using the commands stated above. We are left with:
$ cat new.csv
john,kerry
adam,stephenson
ashley,hudson
$ cat remove.txt
adam
$ awk -F, 'NR==FNR{remove[$1]++;next}!($1 in remove)' remove.txt new.csv
john,kerry
ashley,hudson
Upvotes: 8