Reputation: 705
If I have FILE1 like so:
1 56903
1 293943
1 320022
2 24050
2 404999
2 1003093
and FILE2 like so:
1 rs40209 56903
1 rs209485 79382
1 rs392392 320022
2 rs30302 504922
2 rs3202309 707899
2 rs39339 1003093
and I want to pull out the second column provided that the 1st and second of FILE1 match with FILE2, I can use a nested loop to read through FILE1 this awk command:
while read COL1 COL2; do
awk -v COL1=$COL1 -v COL2=$COL2 '($1==COL1 && $3==COL2){print $2}' $FILE2
done < "$FILE1"
which would return
rs40209
rs392392
rs39339
as output.
However, using nested loops to do this is very very slow.
Most of the data are sorted, but not all, and I cannot sort because other files depend on these files being in the current order.
Using python, what would be a very fast way to get this done for FILE1 with ~2M entries and FILE2 with ~1M entries?
Upvotes: 1
Views: 331
Reputation: 134048
In Python you would read the contents of 1st file in a set
. The set
is backed by a hash table which average O(1)
(constant-time) lookup:
with open('FILE1') as file:
file1_contents = { tuple(line.split()) for line in file }
Then filter the second file:
with open('FILE2') as file2:
for line in file2:
c1, c2, c3 = line.split()
if (c1, c3) in file1_contents:
print(c2)
Which with FILE1
and FILE2
having contents as in the question, results in output
rs40209
rs392392
rs39339
With 2 million entries, the set will consume quite a bit amount of ram (closer to 1 gigabyte), but will have the fastest asymptotic time complexity. Total time for 2 million entries in FILE1 and 1 million in FILE2 was 5 seconds on my laptop. It also is ~4 times faster than Fabricator's AWK script:
% time awk -f script.awk FILE2 > /dev/null
awk -f script.awk FILE2 > /dev/null 17.47s user 0.14s system 99% cpu 17.606 total
% time python filter.py > /dev/null
python filter.py > /dev/null 4.32s user 0.20s system 99% cpu 4.526 total
Upvotes: 3
Reputation: 12782
awk has associative array, and the first file can be loaded in the beginning:
script.awk:
BEGIN {
while (( getline < "file1.txt" ) > 0) {
file1[$1 " " $2] = 1;
}
}
{
if ($1 " " $3 in file1) {
print $2;
}
}
Then run
awk -f script.awk file2.txt
Upvotes: 1