Reputation: 767
If I use:
import re
words = re.findall(r"(?u)\b\w\w+\b", "aaa, bbb ccc. ddd\naaa xxx yyy")
print(words)
print(len(words))
as expected, I get:
['aaa', 'bbb', 'ccc', 'ddd', 'aaa', 'xxx', 'yyy']
7
Now I would like to modify the regular expression in order to also be able to count 2-grams and 3-grams, taking into account punctuation and newlines. In particular, the result I expect in this case is:
['aaa', 'bbb', 'ccc', 'ddd', 'aaa', 'xxx', 'yyy', 'bbb ccc', 'aaa xxx', 'xxx yyy', 'aaa xxx yyy']
11
How can I modify the regular expression to be able to do this?
Upvotes: 0
Views: 93
Reputation: 15384
Original answer
import re
from itertools import chain
s = "aaa, bbb ccc. ddd\naaa xxx yyy"
result = list(chain(*(re.findall('(?=((?<!\w)\w\w\w+' + ' \w\w\w+' * n + '(?!\w)))', s)
for n in range(3))))
Output:
>>> result
['aaa', 'bbb', 'ccc', 'ddd', 'aaa', 'xxx', 'yyy', 'bbb ccc', 'aaa xxx', 'xxx yyy', 'aaa xxx yyy']
Improved answer (thanks to @CasimiretHippolyte for the useful comments)
import re
from itertools import chain
s = "aaa, bbb ccc. ddd\naaa xxx yyy"
result = list(chain(*(re.findall(r'\b(?=(\w\w\w+' + ' \w\w\w+' * n + '))', s)
for n in range(3))))
Upvotes: 2