Reputation: 819
I want to extract words from a string that contain specific character (/IN) until to other specific character (/NNP). My code so far (still not work):
import re
sentence = "Entah/RB kenapa/NN ini/DT bayik/NN suka/VBI banget/JJ :/: )/CP :/: )/CP :/: )/CP berenang/VBI di/IN Jln/NN Terusan/NNP Borobudur/NNP dan/NN di/IN Jalan/NN Perempatan/ADJ Malioboro/NNP"
tes = re.findall(r'((?:\S+/IN\s\w+/NNP\s*)+)', sentence)
print(tes)
So the sentence
contain words di/IN Jln/NN Terusan/NNP Borobudur/NNP
and di/IN Jalan/NN Perempatan/ADJ Malioboro/NNP
that I like to extract. The expected result:
['di/IN Jln/NN Terusan/NNP Borobudur/NNP', 'di/IN Jalan/NN Perempatan/ADJ Malioboro/NNP']
What is the best way to do this task? thanks.
Upvotes: 2
Views: 2316
Reputation: 626691
Use
r'\S+/IN\b(?:(?!\S+/IN\b).)+\S+/NNP\b'
See the regex demo
Details
\S+
- 1+ non-whitespace symbols/IN\b
- a /IN
substring as a whole word(?:(?!\S+/IN\b).)+
- any 1+ chars other than line break chars that do not match the \S+/IN\b
pattern sequence (use re.DOTALL
to match line breaks, too)\S+/NNP\b
- 1+ non-whitespaces, /NNP
as a whole word (since \b
is a word boundary)Upvotes: 1