mike_pro
mike_pro

Reputation: 115

efficient string comparison in python including numerical evaluation

I have two moderately large ascii files containing data in a fixed format. I need to test if 6 given fields in a line of the first file match (within a given tolerance) six fields on any line of the second file then output a common line to continue processing.

I am currently spliting each line in a file using a fortran style line reader, and generating a list of lists with the correct type for each element in each list. I am storing the lists of lists from both files in memory wihilst I operate on them

The fields I need to compare are all floats and I am currently using the following type of flow:

tol = 0.01
for entry1 in file1List:
    for entry2 in file2List:
        if (abs(entry1[1] - entry2[1]) < tol and abs(entry1[2] - entry2[2]) < tol 
            and abs(entry1[3] - entry2[3]) < tol and abs(entry1[4] - entry2[4]) < tol
            and abs(entry1[5] - entry2[5]) < tol and abs(entry1[6] - entry2[6]) < tol):
            print entry1,entry2

The execution of this is fine over a file containing only a small number of lines, but over 30000 lines the execution of this part alone is over 1 min!

I am fairly certain there must be a much faster comparison method but I am struggling to find it, any help would be appreciated.

Upvotes: 3

Views: 2426

Answers (4)

LSerni
LSerni

Reputation: 57408

You can sort both file1List and file2List according to the value of

entry1[1]^2 + entry1[2]^2 + ... + entry1[6]^2

These are two lists of vector metrics in six-space. Given a vector in file1List, you can find the position of vectors of the same length using the Newton-Raphson method (or also bisection: Python [already supports][1] this.

Scanning for the next values in file2List matching the next value of file1List, you will only need to scan forward, and the first value in file2List exceeding tol will mean the end of the search for that particular value of file1List.

I believe the algorithm for the original JOIN utility was more or less similar.

Upvotes: 0

mgilson
mgilson

Reputation: 309929

If you can store the elements in file1list and file2list as numpy arrays, your comparison becomes a good bit more simple (and probably faster too):

for entry1 in file1list:
    for entry2 in file2list:
        if(np.all(np.abs(entry1[1:7] - entry2[1:7]) > tol)):
            print entry1,entry2

Converting to numpy arrays is painless:

file1list = [ np.array(entry) for entry in file1list ]

This gets even a little bit better for a sorted file2list

file2list=sorted(file2list,key=operator.itemgetter(1,2,3,4,5,6))
for entry1 in file1list:
    for entry2 in file2list:
        d=np.abs(entry1[1:7]-entry2[1:7])
        if(np.all(d < tol)):
            print entry1,entry2
        elif(d[0] > tol):
            break 

Upvotes: 2

Danica
Danica

Reputation: 28846

Your current processing is O(n m), where n is the number of lines in file 1 and m the number in file 2. When n ~ m ~ 30,000, it's doing 900,000,000 comparisons: no surprise it takes a minute!

This is made a little more complicated by the fact that you're allowed to be off by a tolerance: if it were exact matching, you could just make a dictionary of the values from one of the files and then have O(m) time to build the dictionary and O(n) time to do the lookups, since hash-based dictionaries have constant-time lookups.

One possibility that's somewhat similar would be to make a dictionary of rounded values, so that things within tol are keyed by the same value, and then make sure that everything's within tol when you find a possible match. You could do that by rounding each of the entries to something slightly bigger than tol: that is, if tol is 1e-3, key based on the entries rounded to 1e-2. How effective this is depends on the distribution of your values versus tol, but it should be pretty good.

That is, do something like this (untested, but it should probably work):

 import math
 import operator

 cmp_fields = operator.itemgetter(1, 2, 3, 4, 5, 6)

 # higher-order function to get a key-getting function for a given tolerance
 def key_getter(tol):
     prec = -math.log10(tol) - 1
     def inner(entry):
         return tuple(math.round(x, prec) for x in cmp_fields(entry))
     return inner

 # make the key function we want here
 key_fn = key_getter(tol)

 # build the dictionary of entries from file2: key maps to a list of entries
 file_2_entries = collections.defaultdict(list)
 for entry in file2List:
     file_2_entries[key_fn(entry)].append(entry)

 for entry1 in file1List: # for each entry from file1...
     # check only the potential matches based on rounding from 2
     for entry2 in file_2_entries[key_fn(entry2)]:
         if all(abs(x1 - x2) < tol for x1, x2
                in zip(map(cmp_fields, (entry1, entry2)))):
             print entry1, entry2

Upvotes: 1

Andrew Clark
Andrew Clark

Reputation: 208495

The issue with your current method is that you are comparing every line in the first file to every line in the second file. To cut down on the running time you need a method to short-circuit the second for loop, with some logic like "if this iteration matches some condition, then I can break out of this loop because it is impossible that any of the subsequent entries can match".

To support this logic, you will first need to sort both lists, for example:

from operator import itemgetter
comp_fields = itemgetter(1, 2, 3, 4, 5, 6)
file1List.sort(key=comp_fields)
file2List.sort(key=comp_fields)
for entry1 in file1List:
    for entry2 in file2List:
        if (abs(entry1[1] - entry2[1]) < tol and abs(entry1[2] - entry2[2]) < tol 
            and abs(entry1[3] - entry2[3]) < tol and abs(entry1[4] - entry2[4]) < tol
            and abs(entry1[5] - entry2[5]) < tol and abs(entry1[6] - entry2[6]) < tol):
            print entry1,entry2
        elif entry2[1] - entry1[1] >= tol:
            # we know subsequent lines in file2List can't match due to ordering
            break

This is a pretty naive solution, for example one possible improvement would be to keep track of starting points, for example if on the previous iteration of file1List we went 2000 lines into file2List before finding a match, on the next iteration of file1List we can start the iteration of file2List at line 2000. It also only does the short-circuiting on the first column, and you may be able to add some additional logic to have it short-circuit on the subsequent columns as well (this gets kind of tricky due to there being a tolerance instead of requiring exact matches).

Upvotes: 1

Related Questions