Ricky Robinson
Ricky Robinson

Reputation: 22933

Python: comparing a few thousand strings. Any fast alternatives for comparison?

I have a set of around 6 000 packets which for comparison purposes I represent as strings (first 28 bytes) to compare against just as many packets, which I also represent as strings of 28 bytes.

I have to match each packet of one set with all of the other. Matchings are always unique.

I found that comparing strings takes a bit of time. Is there any way to speed up the process?

EDIT1: I wouldn't like to permutate string elements because I am always making sure that ordering between packet list and corresponding string list is preserved.

EDIT2: Here's my implementation:

list1, list2 # list of packets (no duplicates present in each list!)
listOfStrings1, listOfStrings2 # corresponding list of strings. Ordering is preserved.
alreadyMatchedlist2Indices = []
for list1Index in xrange(len(listOfStrings1)):
            stringToMatch = listOfStrings1[list1Index]
            matchinglist2Indices = [i for i, list2Str in enumerate(listOfStrings2)
                                if list2Str == stringToMatch and i not in alreadyMatchedlist2Indices]
            if not matchinglist2Indices:
                tmpUnmatched.append(list1Index)
            elif len(matchinglist2Indices) == 1:
                tmpMatched.append([list1Index, matchinglist2Indices[0]])
                alreadyMatchedlist2Indices.append(matchinglist2Indices[0])
            else:
                list2Index = matchinglist2Indices[0] #taking first matching element anyway
                tmpMatched.append([list1Index, list2Index])
                alreadyMatchedlist2Indices.append(list2Index)

Upvotes: 2

Views: 3003

Answers (3)

andmart
andmart

Reputation: 552

what's wrong with:

matches = [packet for packet in list1 if packet in list2]

Upvotes: 0

senderle
senderle

Reputation: 151157

Here's a simple linear time approach -- at least if I understand your question correctly:

>>> def get_matches(a, b):
...     reverse_map = {x:i for i, x in enumerate(b)}
...     return [(i, reverse_map[x]) for i, x in enumerate(a) if x in reverse_map]
... 
>>> get_matches(['a', 'b', 'c'], ['c', 'd', 'e'])
[(2, 0)]

This accepts two sequences of strings, a and b, and returns a list of matches represented as tuples of indices into a and b. This is O(n + m) where m and n are the lengths of a and b.

Upvotes: 4

Lewis Diamond
Lewis Diamond

Reputation: 25011

---Here I'm assuming you're taking every strings one by one and comparing to all others.---

I suggest sorting your list of string and comparing neighboring strings. This should have a runtime of O(nlogn).

Upvotes: 5

Related Questions