Count the n-grams that match a pattern using regular expressions

Question

If I use:

import re
words = re.findall(r"(?u)\b\w\w+\b", "aaa, bbb ccc. ddd
aaa xxx yyy")
print(words)
print(len(words))

as expected, I get:

['aaa', 'bbb', 'ccc', 'ddd', 'aaa', 'xxx', 'yyy']
7

Now I would like to modify the regular expression in order to also be able to count 2-grams and 3-grams, taking into account punctuation and newlines. In particular, the result I expect in this case is:

['aaa', 'bbb', 'ccc', 'ddd', 'aaa', 'xxx', 'yyy', 'bbb ccc', 'aaa xxx', 'xxx yyy', 'aaa xxx yyy']
11

How can I modify the regular expression to be able to do this?

Riccardo Bucco · Accepted Answer

Original answer

import re
from itertools import chain

s = "aaa, bbb ccc. ddd
aaa xxx yyy"
result = list(chain(*(re.findall('(?=((?


Output:
>>> result
['aaa', 'bbb', 'ccc', 'ddd', 'aaa', 'xxx', 'yyy', 'bbb ccc', 'aaa xxx', 'xxx yyy', 'aaa xxx yyy']

Improved answer (thanks to @CasimiretHippolyte for the useful comments)
import re
from itertools import chain

s = "aaa, bbb ccc. ddd
aaa xxx yyy"
result = list(chain(*(re.findall(r'\b(?=(\w\w\w+' + ' \w\w\w+' * n + '))', s)
                      for n in range(3))))

Count the n-grams that match a pattern using regular expressions

Answers (1)

Related Questions