mf94
mf94

Reputation: 479

Bash: excluding duplicate rows (both of each pair) based on one column

I have a file (called example.txt) that looks like the following:

A B C  
D E F  
H I C  
Z B Y  
A B C  
T E F  
W O F  

Based on column 2 only, I wish to identify all rows which have a non-unique entry and remove them completely. My real file may have duplicates entries, triplicates entries, quadruple entries etc. I just want to keep the rows for which the entry of column 2 is unique.

The output file should look like this:

H I C  
W O F

I initially wanted to do this in R but my file is so big that R is too slow and is crashing. So I would like to do this in bash directly. I am new to bash, I tried this but it is not working:

arrayTmp=($(cat example.txt | awk '{print $2}' | sort | uniq -d))  
sed "/${arrayTmp[@]}\/d" example.txt

Upvotes: 0

Views: 74

Answers (2)

Unwastable
Unwastable

Reputation: 679

Assuming these characters only presence in 2nd column, this is achievable by selecting non-matching lines in example.txt, and no array required.

tmp=$(cat example.txt | awk '{print $2}' | sort | uniq -d)
grep -v -f <(echo -e "$tmp") example.txt

output:

H I C
W O F

Upvotes: 1

Juan Diego Godoy Robles
Juan Diego Godoy Robles

Reputation: 14975

If the order does not matter:

awk '{a[$2]=$0;b[$2]++}END{for (i in b){if(b[i]==1){print a[i]}}}' your_file

Upvotes: 1

Related Questions