Vincent Laufer
Vincent Laufer

Reputation: 705

Speed the process of matching two columns using python

If I have FILE1 like so:

1    56903
1    293943
1    320022
2    24050
2    404999
2    1003093

and FILE2 like so:

1    rs40209    56903
1    rs209485   79382
1    rs392392   320022
2    rs30302    504922
2    rs3202309  707899
2    rs39339    1003093

and I want to pull out the second column provided that the 1st and second of FILE1 match with FILE2, I can use a nested loop to read through FILE1 this awk command:

while read COL1 COL2; do

    awk -v COL1=$COL1  -v COL2=$COL2  '($1==COL1 && $3==COL2){print $2}' $FILE2

done < "$FILE1"

which would return

rs40209
rs392392
rs39339

as output.

However, using nested loops to do this is very very slow.

Most of the data are sorted, but not all, and I cannot sort because other files depend on these files being in the current order.

Using python, what would be a very fast way to get this done for FILE1 with ~2M entries and FILE2 with ~1M entries?

Upvotes: 1

Views: 331

Answers (2)

In Python you would read the contents of 1st file in a set. The set is backed by a hash table which average O(1) (constant-time) lookup:

with open('FILE1') as file:
    file1_contents = { tuple(line.split()) for line in file }

Then filter the second file:

with open('FILE2') as file2:
    for line in file2:
        c1, c2, c3 = line.split()
        if (c1, c3) in file1_contents:
            print(c2)

Which with FILE1 and FILE2 having contents as in the question, results in output

rs40209
rs392392
rs39339

With 2 million entries, the set will consume quite a bit amount of ram (closer to 1 gigabyte), but will have the fastest asymptotic time complexity. Total time for 2 million entries in FILE1 and 1 million in FILE2 was 5 seconds on my laptop. It also is ~4 times faster than Fabricator's AWK script:

% time awk -f script.awk FILE2 > /dev/null
awk -f script.awk FILE2 > /dev/null  17.47s user 0.14s system 99% cpu 17.606 total
% time python filter.py > /dev/null     
python filter.py > /dev/null  4.32s user 0.20s system 99% cpu 4.526 total

Upvotes: 3

Fabricator
Fabricator

Reputation: 12782

awk has associative array, and the first file can be loaded in the beginning:

script.awk:

BEGIN {
    while (( getline < "file1.txt" ) > 0) {
        file1[$1 " " $2] = 1;
    }
}

{
    if ($1 " " $3 in file1) {
        print $2;
    }
}

Then run

awk -f script.awk file2.txt

Upvotes: 1

Related Questions