Reputation: 157
I got a dataset like:
6 AA_A_56_30018678_E 0 30018678 P A
6 SNP_A_30018678 0 30018678 A G
6 SNP_A_30018679 0 30018679 T G
6 SNP_A_30018682 0 30018682 T G
6 SNP_A_30018695 0 30018695 G C
6 AA_A_62_30018696_Q 0 30018696 P A
6 AA_A_62_30018696_G 0 30018696 P A
6 AA_A_62_30018696_R 0 30018696 P A
I want to remove all the rows if col 4 have duplicates.
I have use the below codes (using sort, awk,uniq and join...) to get the required output, however, is there a better way to do this?
sort -k4,4 example.txt | awk '{print $4}' | uniq -u > snp_sort.txt
join -1 1 -2 4 snp_sort.txt example.txt | awk '{print $3,$5,$6,$1}' > uniq.txt
Here is the output
SNP_A_30018679 T G 30018679
SNP_A_30018682 T G 30018682
SNP_A_30018695 G C 30018695
Upvotes: 2
Views: 4466
Reputation: 459
A simpler way to achieve this,
cat file.csv | cut -d, -f3,5,6,1 | sort -u > uniq.txt
Upvotes: 1
Reputation: 37394
Another in awk:
$ awk '{$1=$1; a[$4]=a[$4] $0} END{for(i in a) if(gsub(FS,FS,a[i])==5) print a[i]}' file
6 SNP_A_30018679 0 30018679 T G
6 SNP_A_30018682 0 30018682 T G
6 SNP_A_30018695 0 30018695 G C
Catenate to array using $4
as key. If there are more than 5 field separators, duplicates were catenated and will not be printed.
And yet an another version in awk. It expects the file to be sorted on the fourth field. It won't store all lines in memory, only the keys (this probably could be dealt with also since the key field must be sorted, may be fixed later) and runs in one go:
$ cat ananother.awk
++seen[p[4]]==1 && NR>1 && p[4]!=$4 { # seen count must be 1 and
print prev # this and previous $4 must differ
delete seen # is this enough really?
}
{
q=p[4] # previous previous $4 for END
prev=$0 # previous is stored for printing
split($0,p) # to get previous $4
}
END { # last record control
if(++seen[$4]==1 && q!=$4)
print $0
}
Run:
$ sort -k4,4 file | awk -f ananother.awk
Upvotes: 1
Reputation: 203169
$ awk 'NR==FNR{c[$4]++;next} c[$4]<2' file file
6 SNP_A_30018679 0 30018679 T G
6 SNP_A_30018682 0 30018682 T G
6 SNP_A_30018695 0 30018695 G C
Upvotes: 2
Reputation: 18351
Using command substitution: First print only unique
columns in fourth field and then grep those columns.
grep "$(echo "$(awk '{print $4}' inputfile.txt)" |sort |uniq -u)" inputfile.txt
6 SNP_A_30018679 0 30018679 T G
6 SNP_A_30018682 0 30018682 T G
6 SNP_A_30018695 0 30018695 G C
Note: add awk '{NF=4}1'
at the end of the command, if you wist to print first four columns. Of course you can change the number of columns by changing value of $4
and NF=4
.
Upvotes: 2
Reputation: 85530
Using awk
to filter-out duplicate lines and print those lines which occur exactly once.
awk '{k=($2 FS $5 FS $6 FS $4)} {a[$4]++;b[$4]=k}END{for(x in a)if(a[x]==1)print b[x]}' input_file
SNP_A_30018682 T G 30018682
SNP_A_30018695 G C 30018695
SNP_A_30018679 T G 30018679
The idea is to:-
$4
entries in a an array(a
) and maintain a counter for that in array b
Upvotes: 2
Reputation: 4392
Since your 'key' is fixed width, then uniq
has a -w
to check on it.
sort -k4,4 example.txt | uniq -u -f 3 -w 8 > uniq.txt
Upvotes: 1