Reputation: 479
I have a file (called example.txt) that looks like the following:
A B C
D E F
H I C
Z B Y
A B C
T E F
W O F
Based on column 2 only, I wish to identify all rows which have a non-unique entry and remove them completely. My real file may have duplicates entries, triplicates entries, quadruple entries etc. I just want to keep the rows for which the entry of column 2 is unique.
The output file should look like this:
H I C
W O F
I initially wanted to do this in R but my file is so big that R is too slow and is crashing. So I would like to do this in bash directly. I am new to bash, I tried this but it is not working:
arrayTmp=($(cat example.txt | awk '{print $2}' | sort | uniq -d))
sed "/${arrayTmp[@]}\/d" example.txt
Upvotes: 0
Views: 74
Reputation: 679
Assuming these characters only presence in 2nd column, this is achievable by selecting non-matching lines in example.txt, and no array required.
tmp=$(cat example.txt | awk '{print $2}' | sort | uniq -d)
grep -v -f <(echo -e "$tmp") example.txt
output:
H I C
W O F
Upvotes: 1
Reputation: 14975
If the order does not matter:
awk '{a[$2]=$0;b[$2]++}END{for (i in b){if(b[i]==1){print a[i]}}}' your_file
Upvotes: 1