ItaiS
ItaiS

Reputation: 128

Using Regex to find pairs of words out of a list of words

I have a list of words such as:

l = """abc
dfg
hij
jih
gfd
cba
cbd
jip
gfe
jiw
cbw"""

I want to find pairs of words out of this list, so the first word is:

.(.)(.)

And the second word is:

\2\1.

So the \1 and the \2 refer to the characters in the first word.

The best regexp that I could come up with is:

re.findall('(^.(?P<A>.)(?P<B>.)$)(?=.*(^(?P=B)(?P=A).$))', l, re.DOTALL | re.MULTILINE)

But this search returns only some of the pairs (because findall returns only non-overlapping results...). Then I thought of using positive lookbehind assertions, but they can only be used with strings of fixed length...

Is there a way to so this with regex?

Upvotes: 4

Views: 1570

Answers (1)

Niklas B.
Niklas B.

Reputation: 95318

I doubt that regular expressions are a good way to do this (especially in Python, were you can't simply get all possible ways of matching a string like in Perl, so you'd have to call findall on all prefixes of your string). A straightforward alternative would be:

words = l.split()
pairs = set(frozenset((w1, w2)) for w1 in words for w2 in words 
                      if w1[1:] == w2[1::-1])

Results in:

>>> map(tuple, pairs)
[('hij', 'jip'), 
 ('abc', 'cbd'), 
 ('dfg', 'gfd'), 
 ('dfg', 'gfe'), 
 ('jiw', 'hij'), 
 ('hij', 'jih'), 
 ('abc', 'cbw'), 
 ('abc', 'cba')]

You could also solve this really fast by saving the prefixes of the words in a dictionary in a first pass and then build up the associations in a second pass:

from collections import defaultdict

prefixes = defaultdict(list)
for w in words:
    prefixes[w[1::-1]].append(w) 
pairs = set(frozenset((w1, w2)) for w1 in words for w2 in prefixes[w1[1:]])

The performance of this will be hard to beat by a regular expression engine.

Upvotes: 2

Related Questions