Reputation: 57
I would be grateful for your help with the following.
I have the following file (file.txt), which is about 10,000 lines long:
ID1 ID2 0 1 0.5 0.6
ID3 ID4 0 0 0.4 0.8
ID1 ID5 0 1 0.5 0.3
ID6 ID2 1 0 0.4 0.8
The IDs in the first two columns can occur between 1 to 10 times in the file (in either column 1 or column 2).
What I want to achieve:
I want to scan this file line by line, and print IDs to an ever-growing exclusion list if they meet the following criteria:
My criteria are follows:
If $3 > $4, print $2 (ID2) to exclusionlist.txt
If $3 < $4, print $1 (ID1) to exclusionlist.txt
If $3 = $4 and $5 < $6, print $2 (ID2) to exclusionlist.txt
If $3 = $4 and $5 > $6, print $1 (ID1) to exclusionlist.txt
So applying this to row 1, either ID1 should be in my exclusionlist, given that $3 < $4.
I then want to delete all lines in the file where that ID from the exclusion list appears. (This can be up to 10 rows).
The output for file.txt once row 1 has been scanned should look like:
ID3 ID4 0 0 0.4 0.8
ID6 ID2 1 0 0.4 0.8
And exclusionlist.txt: ID1
I then want to start again at the new row 1 (becuase the original row 1 will have been deleted by definition), and execute the same process, but keep adding my exclusion from the new row 1 to the same exclusion list.
This is what have tried. It has meant having to rename file.txt to 1.txt
#! bin/bash
for i in {1..5000}
do
awk 'NR==1{print;}' $i.txt
awk '{if ($3>$4 || $3==$4 && $5<$6) print $2;}' $i.txt > exclusionlist_$i.txt
awk '{if ($3>$4 || $3==$4 && $5>$6) print $1;}' $i.txt >> exclusionlist_$i.txt
grep -v -f exclusionlist_$i.txt $i.txt > $((i+1)).txt
rm $i.txt
done
Due to my poor scripting skills, I am having to: (1) rename my file after each loop in order for it to be continuously executed, and (2) ending up with a new exclusion list per loop, rather than a single 'master' exclusion list - I can easily concatenate them all at the end, so this is not a major problem, but messy.
The problem I have is that this command seems to scan through the whole file (rather than just line 1), creating a long exclusion list just from the first run.
Any help/suggestions would be greatly appreciated.
Thank you.
GB
Upvotes: 0
Views: 89
Reputation: 67467
I didn't understand why you need to do this in multiple steps. Eventually, all the lines will be deleted and you'll only get the exclusion list.
For example, this will do the same in one pass
$ awk '!($1 in exc) && !($2 in exc){f=($3>$4 || $3==$4 && $5<$6)?2:1;
print $f > "exclusion.list"; exc[$f]}' file
$ cat exclusion.list
ID1
ID4
ID2
since the only outcome is the exclusion list, you can print it to stdout
$ awk '!($1 in exc) && !($2 in exc){f=($3>$4 || $3==$4 && $5<$6)?2:1;
print $f; exc[$f]}' file > exclusion.list
and redirect to a file.
Or, perhaps I misunderstood the problem. Note also that $3==$4 && $5==$6
condition is not defined in your spec. Perhaps that's what you're after?! If so, create the sample data with this critical case and indicate what needs to happen.
Upvotes: 1