Reputation: 819
I have a string contain words, each word has its own token (eg. NN/NNP/JJ etc). I want to take specific repeat words that contain NNP token. My code so far:
import re
sentence = "Rapunzel/NNP Sheila/NNP let/VBD down/RP her/PP$ long/JJ golden/JJ hair/NN in Yasir/NNP"
tes = re.findall(r'(\w+)/NNP', sentence)
print(tes)
The result of the code:
['Rapunzel', 'Sheila', 'Yasir']
As we see, there are 3 words contain NNP those are Rapunzel/NNP Sheila/NNP (appear next to each other) and Yasir/NNP (seperate by words to other NNP words). My problem is I need to sperate the word with repeat NNP and the other. My expected result is like :
['Rapunzel/NNP', 'Sheila/NNP'], ['Yasir/NNP']
What is the best way to perform this task, thanks.
Upvotes: 4
Views: 1987
Reputation: 54223
Here's an alternative without any regex. It uses groupby
and split()
:
from itertools import groupby
string = "Rapunzel/NNP Sheila/NNP let/VBD down/RP her/PP$ long/JJ golden/JJ hair/NN in Yasir/NNP"
words = string.split()
def get_token(word):
return word.split('/')[-1]
print([list(ws) for token, ws in groupby(words, get_token) if token == "NNP"])
# [['Rapunzel/NNP', 'Sheila/NNP'], ['Yasir/NNP']]
Upvotes: 1
Reputation: 336158
Match the groups as simple strings, and then split them:
>>> [m.split() for m in re.findall(r"\w+/NNP(?:\s+\w+/NNP)*", sentence)]
[['Rapunzel/NNP', 'Sheila/NNP'], ['Yasir/NNP']]
Upvotes: 4
Reputation: 785126
You can get very close to your expected outcome using a different capture group.
>>> re.findall(r'((?:\w+/NNP\s*)+)', sentence)
['Rapunzel/NNP Sheila/NNP ', 'Yasir/NNP']
Capture group ((?:\w+/NNP\s*)+)
will group all the \w+/NNP
patterns together with optional spaces in between.
Upvotes: 3