Tabbi
Tabbi

Reputation: 69

Removing specific lines/rows or duplicates from a file in Linux

I want to remove specific rows from a .csv file (named 'mydata') on Linux using values under a specific variable 'SAMPLE'. I actually have 250 duplicates in my file that in original have 15000 rows and 66 columns and I want to remove one copy of each duplicate and keep the one. If you look at other variables e.g., 'ID', you will see duplicates are there. So either if I could remove duplicates or specific rows based on values in 'SAMPLE' column, any solution would do for me. SAMPLE is the only column in my data where names of duplicates are not same. My data looks like this;

 ID      SAMPLE LABNO Oth_ID sex   age   bmi ca_1 pd_7
1003341  21863 21863     NA   1 48.68 22.42    0    0
1003343  22697 22697     NA   1 48.98 23.25    0    0
1003347   4421  4421     NA   1 48.70 25.56   NA   NA
1003348   1642  1642     NA   1 48.72 16.57   NA   NA
1003349   4163  4163   6069   1 49.02 23.47    1   NA
1003349   6069  4163   6069   1 49.02 23.47    1   NA
1003356   5347  9053   5347   1 49.08 24.81    0    0
1003356   9053  9053   5347   1 49.08 24.81    0    0
1003357    695   695     NA   1 49.08 22.32   NA   NA
1003360  19833 19833     NA   1 48.55 22.48    0    0
1003365   5392  6843   5392   1 48.70 23.08    0    0
1003365   6843  6843   5392   1 48.70 23.08    0    0

Upvotes: 0

Views: 726

Answers (2)

LeadingEdger
LeadingEdger

Reputation: 714

@zinovyev's code works fine for me. @Tabbi, to resolve your issue, run the remove_dup.sh script (Code + DATA in one file) below:

#!/bin/bash

# Write the top line
sed '0,/^__DATA__$/d' "$0" | head -n 1

# Make unique, sort and append other lines
sed '0,/^__DATA__$/d' "$0" | tail -n +2 | sort -k 3,4 -u | sort -V

exit

__DATA__
 ID      SAMPLE LABNO Oth_ID sex   age   bmi ca_1 pd_7
1003341  21863 21863     NA   1 48.68 22.42    0    0
1003343  22697 22697     NA   1 48.98 23.25    0    0
1003347   4421  4421     NA   1 48.70 25.56   NA   NA
1003348   1642  1642     NA   1 48.72 16.57   NA   NA
1003349   4163  4163   6069   1 49.02 23.47    1   NA
1003349   6069  4163   6069   1 49.02 23.47    1   NA
1003356   5347  9053   5347   1 49.08 24.81    0    0
1003356   9053  9053   5347   1 49.08 24.81    0    0
1003357    695   695     NA   1 49.08 22.32   NA   NA
1003360  19833 19833     NA   1 48.55 22.48    0    0
1003365   5392  6843   5392   1 48.70 23.08    0    0
1003365   6843  6843   5392   1 48.70 23.08    0    0

and you should see the result

 ID      SAMPLE LABNO Oth_ID sex   age   bmi ca_1 pd_7
1003341  21863 21863     NA   1 48.68 22.42    0    0
1003343  22697 22697     NA   1 48.98 23.25    0    0
1003347   4421  4421     NA   1 48.70 25.56   NA   NA
1003348   1642  1642     NA   1 48.72 16.57   NA   NA
1003349   4163  4163   6069   1 49.02 23.47    1   NA
1003356   5347  9053   5347   1 49.08 24.81    0    0
1003357    695   695     NA   1 49.08 22.32   NA   NA
1003360  19833 19833     NA   1 48.55 22.48    0    0
1003365   5392  6843   5392   1 48.70 23.08    0    0

Upvotes: 2

zinovyev
zinovyev

Reputation: 2163

You can use a simple sort command. Just define the proper column(s) with the -k key and add the -u option to make the result set uniq:

sort -k 3,4 -u < mydata.csv

In this answer I suppose that you're looking for unique values based on the LABNO and Oth_ID columns which is the numbers 3 and 4.

Actually to keep the column names on top use:

# Write the top line to the new file
head -n 1 mydata.csv > mydata.uniq.csv

# Make unique, sort and append other lines
tail -n +2 mydata.csv | sort -k 3,4 -u | sort -V >> mydata.uniq.csv

Before:

 ID      SAMPLE LABNO Oth_ID sex   age   bmi ca_1 pd_7
1003341  21863 21863     NA   1 48.68 22.42    0    0
1003343  22697 22697     NA   1 48.98 23.25    0    0
1003347   4421  4421     NA   1 48.70 25.56   NA   NA
1003348   1642  1642     NA   1 48.72 16.57   NA   NA
1003349   4163  4163   6069   1 49.02 23.47    1   NA
1003349   6069  4163   6069   1 49.02 23.47    1   NA
1003356   5347  9053   5347   1 49.08 24.81    0    0
1003356   9053  9053   5347   1 49.08 24.81    0    0
1003357    695   695     NA   1 49.08 22.32   NA   NA
1003360  19833 19833     NA   1 48.55 22.48    0    0
1003365   5392  6843   5392   1 48.70 23.08    0    0
1003365   6843  6843   5392   1 48.70 23.08    0    0

After:

 ID      SAMPLE LABNO Oth_ID sex   age   bmi ca_1 pd_7
1003341  21863 21863     NA   1 48.68 22.42    0    0
1003343  22697 22697     NA   1 48.98 23.25    0    0
1003347   4421  4421     NA   1 48.70 25.56   NA   NA
1003348   1642  1642     NA   1 48.72 16.57   NA   NA
1003349   4163  4163   6069   1 49.02 23.47    1   NA
1003356   5347  9053   5347   1 49.08 24.81    0    0
1003357    695   695     NA   1 49.08 22.32   NA   NA
1003360  19833 19833     NA   1 48.55 22.48    0    0
1003365   5392  6843   5392   1 48.70 23.08    0    0

Upvotes: 2

Related Questions