Reputation: 69
I want to remove specific rows from a .csv file (named 'mydata') on Linux using values under a specific variable 'SAMPLE'. I actually have 250 duplicates in my file that in original have 15000 rows and 66 columns and I want to remove one copy of each duplicate and keep the one. If you look at other variables e.g., 'ID', you will see duplicates are there. So either if I could remove duplicates or specific rows based on values in 'SAMPLE' column, any solution would do for me. SAMPLE is the only column in my data where names of duplicates are not same. My data looks like this;
ID SAMPLE LABNO Oth_ID sex age bmi ca_1 pd_7
1003341 21863 21863 NA 1 48.68 22.42 0 0
1003343 22697 22697 NA 1 48.98 23.25 0 0
1003347 4421 4421 NA 1 48.70 25.56 NA NA
1003348 1642 1642 NA 1 48.72 16.57 NA NA
1003349 4163 4163 6069 1 49.02 23.47 1 NA
1003349 6069 4163 6069 1 49.02 23.47 1 NA
1003356 5347 9053 5347 1 49.08 24.81 0 0
1003356 9053 9053 5347 1 49.08 24.81 0 0
1003357 695 695 NA 1 49.08 22.32 NA NA
1003360 19833 19833 NA 1 48.55 22.48 0 0
1003365 5392 6843 5392 1 48.70 23.08 0 0
1003365 6843 6843 5392 1 48.70 23.08 0 0
Upvotes: 0
Views: 726
Reputation: 714
@zinovyev's code works fine for me. @Tabbi, to resolve your issue, run the remove_dup.sh script (Code + DATA in one file) below:
#!/bin/bash
# Write the top line
sed '0,/^__DATA__$/d' "$0" | head -n 1
# Make unique, sort and append other lines
sed '0,/^__DATA__$/d' "$0" | tail -n +2 | sort -k 3,4 -u | sort -V
exit
__DATA__
ID SAMPLE LABNO Oth_ID sex age bmi ca_1 pd_7
1003341 21863 21863 NA 1 48.68 22.42 0 0
1003343 22697 22697 NA 1 48.98 23.25 0 0
1003347 4421 4421 NA 1 48.70 25.56 NA NA
1003348 1642 1642 NA 1 48.72 16.57 NA NA
1003349 4163 4163 6069 1 49.02 23.47 1 NA
1003349 6069 4163 6069 1 49.02 23.47 1 NA
1003356 5347 9053 5347 1 49.08 24.81 0 0
1003356 9053 9053 5347 1 49.08 24.81 0 0
1003357 695 695 NA 1 49.08 22.32 NA NA
1003360 19833 19833 NA 1 48.55 22.48 0 0
1003365 5392 6843 5392 1 48.70 23.08 0 0
1003365 6843 6843 5392 1 48.70 23.08 0 0
and you should see the result
ID SAMPLE LABNO Oth_ID sex age bmi ca_1 pd_7
1003341 21863 21863 NA 1 48.68 22.42 0 0
1003343 22697 22697 NA 1 48.98 23.25 0 0
1003347 4421 4421 NA 1 48.70 25.56 NA NA
1003348 1642 1642 NA 1 48.72 16.57 NA NA
1003349 4163 4163 6069 1 49.02 23.47 1 NA
1003356 5347 9053 5347 1 49.08 24.81 0 0
1003357 695 695 NA 1 49.08 22.32 NA NA
1003360 19833 19833 NA 1 48.55 22.48 0 0
1003365 5392 6843 5392 1 48.70 23.08 0 0
Upvotes: 2
Reputation: 2163
You can use a simple sort
command. Just define the proper column(s) with the -k
key and add the -u
option to make the result set uniq:
sort -k 3,4 -u < mydata.csv
In this answer I suppose that you're looking for unique values based on the LABNO
and Oth_ID
columns which is the numbers 3
and 4
.
Actually to keep the column names on top use:
# Write the top line to the new file
head -n 1 mydata.csv > mydata.uniq.csv
# Make unique, sort and append other lines
tail -n +2 mydata.csv | sort -k 3,4 -u | sort -V >> mydata.uniq.csv
Before:
ID SAMPLE LABNO Oth_ID sex age bmi ca_1 pd_7
1003341 21863 21863 NA 1 48.68 22.42 0 0
1003343 22697 22697 NA 1 48.98 23.25 0 0
1003347 4421 4421 NA 1 48.70 25.56 NA NA
1003348 1642 1642 NA 1 48.72 16.57 NA NA
1003349 4163 4163 6069 1 49.02 23.47 1 NA
1003349 6069 4163 6069 1 49.02 23.47 1 NA
1003356 5347 9053 5347 1 49.08 24.81 0 0
1003356 9053 9053 5347 1 49.08 24.81 0 0
1003357 695 695 NA 1 49.08 22.32 NA NA
1003360 19833 19833 NA 1 48.55 22.48 0 0
1003365 5392 6843 5392 1 48.70 23.08 0 0
1003365 6843 6843 5392 1 48.70 23.08 0 0
After:
ID SAMPLE LABNO Oth_ID sex age bmi ca_1 pd_7
1003341 21863 21863 NA 1 48.68 22.42 0 0
1003343 22697 22697 NA 1 48.98 23.25 0 0
1003347 4421 4421 NA 1 48.70 25.56 NA NA
1003348 1642 1642 NA 1 48.72 16.57 NA NA
1003349 4163 4163 6069 1 49.02 23.47 1 NA
1003356 5347 9053 5347 1 49.08 24.81 0 0
1003357 695 695 NA 1 49.08 22.32 NA NA
1003360 19833 19833 NA 1 48.55 22.48 0 0
1003365 5392 6843 5392 1 48.70 23.08 0 0
Upvotes: 2