LJG
LJG

Reputation: 767

Count the n-grams that match a pattern using regular expressions

If I use:

import re
words = re.findall(r"(?u)\b\w\w+\b", "aaa, bbb ccc. ddd\naaa xxx yyy")
print(words)
print(len(words))

as expected, I get:

['aaa', 'bbb', 'ccc', 'ddd', 'aaa', 'xxx', 'yyy']
7

Now I would like to modify the regular expression in order to also be able to count 2-grams and 3-grams, taking into account punctuation and newlines. In particular, the result I expect in this case is:

['aaa', 'bbb', 'ccc', 'ddd', 'aaa', 'xxx', 'yyy', 'bbb ccc', 'aaa xxx', 'xxx yyy', 'aaa xxx yyy']
11

How can I modify the regular expression to be able to do this?

Upvotes: 0

Views: 93

Answers (1)

Riccardo Bucco
Riccardo Bucco

Reputation: 15384

Original answer

import re
from itertools import chain

s = "aaa, bbb ccc. ddd\naaa xxx yyy"
result = list(chain(*(re.findall('(?=((?<!\w)\w\w\w+' + ' \w\w\w+' * n + '(?!\w)))', s)
                      for n in range(3))))

Output:

>>> result
['aaa', 'bbb', 'ccc', 'ddd', 'aaa', 'xxx', 'yyy', 'bbb ccc', 'aaa xxx', 'xxx yyy', 'aaa xxx yyy']

Improved answer (thanks to @CasimiretHippolyte for the useful comments)

import re
from itertools import chain

s = "aaa, bbb ccc. ddd\naaa xxx yyy"
result = list(chain(*(re.findall(r'\b(?=(\w\w\w+' + ' \w\w\w+' * n + '))', s)
                      for n in range(3))))

Upvotes: 2

Related Questions