Malena Torres
Malena Torres

Reputation: 93

How can I create a regex from a list of words?

I have a dict of words (actually I have nested dicts of verb conjugations, but that isn't relevant) and I want to make a regex by combining them.

{
  'yo': 'hablaba',
  'tú': 'hablabas',
  'él': 'hablaba',
  'nosotros': 'hablábamos',
  'vosotros': 'hablabais',
  'ellos': 'hablaban',
  'vos': 'hablabas',
}

... to make:

'habl((aba(s|is|n)?)|ábamos)' # I think that's right

If I don't include 'hablábamos' it's easy - they're all the same prefix, and I can get:

'hablaba(s|is|n)?'

... but I want a general form. Is that possible?

Upvotes: 8

Views: 3074

Answers (2)

johnsyweb
johnsyweb

Reputation: 141810

Yes, I believe this is possible.

To get you started, this is how I would break down the problem.

Calculate the root by finding the longest possible string that matches the start of all of the declined values:

>>> root = ''
>>> for c in hablar['yo']:
...     if all(v.startswith(root + c) for v in hablar.itervalues()):
...         root += c
...     else:
...        break
... 
>>> root
'habl'

Whatever's left of the words makes a list of endings.

>>> endings = [v[len(root):] for v in hablar.itervalues()]
>>> print endings
['abas', 'aba', 'abais', 'aba', '\xc3\xa1bamos', 'aban', 'abas']

You may then want to weed out the duplicates:

>>> unique_endings = set(endings)
>>> print unique_endings
set(['abas', 'abais', '\xc3\xa1bamos', 'aban', 'aba'])

Then join these endings together with pipes:

>>> conjoined_endings = '|'.join(unique_endings)
>>> print conjoined_endings
abas|abais|ábamos|aban|aba

Forming the regular expression is a simple matter combining the root and the conjoined_endings string in parentheses:

>>> final_regex = '{}({})'.format(root, conjoined_endings)
>>> print final_regex
habl(abas|abais|ábamos|aban|aba)

Upvotes: 9

Vorsprung
Vorsprung

Reputation: 34357

I think you need to have a less clever approach

>>> x={
...   'yo': 'hablaba',
...   'tú': 'hablabas',
...   'él': 'hablaba',
...   'nosotros': 'hablábamos',
...   'vosotros': 'hablabais',
...   'ellos': 'hablaban',
...   'vos': 'hablabas',
... }
>>> x
{'t\xc3\xba': 'hablabas', 'yo': 'hablaba', 'vosotros': 'hablabais', '\xc3\xa9l': 'hablaba', 'nosotros': 'habl\xc3\xa1bamos', 'ellos': 'hablaban', 'vos': 'hablabas'}
>>> x.values
<built-in method values of dict object at 0x20e6490>
>>> x.values()
['hablabas', 'hablaba', 'hablabais', 'hablaba', 'habl\xc3\xa1bamos', 'hablaban', 'hablabas']
>>> "|".join(x.values())
'hablabas|hablaba|hablabais|hablaba|habl\xc3\xa1bamos|hablaban|hablabas'

If you just join the hash values with an alternation operator then it should do what you want

Upvotes: 3

Related Questions