Bash: excluding duplicate rows (both of each pair) based on one column

Question

I have a file (called example.txt) that looks like the following:

A B C  
D E F  
H I C  
Z B Y  
A B C  
T E F  
W O F

Based on column 2 only, I wish to identify all rows which have a non-unique entry and remove them completely. My real file may have duplicates entries, triplicates entries, quadruple entries etc. I just want to keep the rows for which the entry of column 2 is unique.

The output file should look like this:

H I C  
W O F

I initially wanted to do this in R but my file is so big that R is too slow and is crashing. So I would like to do this in bash directly. I am new to bash, I tried this but it is not working:

arrayTmp=($(cat example.txt | awk '{print $2}' | sort | uniq -d))  
sed "/${arrayTmp[@]}\/d" example.txt

Juan Diego Godoy Robles · Accepted Answer

If the order does not matter:

awk '{a[$2]=$0;b[$2]++}END{for (i in b){if(b[i]==1){print a[i]}}}' your_file

Bash: excluding duplicate rows (both of each pair) based on one column

Answers (2)

Related Questions