ytomo
ytomo

Reputation: 819

How to extract some words pattern from string using regex in python

I want to extract words from a string that contain specific character (/IN) until to other specific character (/NNP). My code so far (still not work):

import re

sentence = "Entah/RB kenapa/NN ini/DT bayik/NN suka/VBI banget/JJ :/: )/CP :/: )/CP :/: )/CP berenang/VBI di/IN Jln/NN Terusan/NNP Borobudur/NNP dan/NN di/IN Jalan/NN Perempatan/ADJ Malioboro/NNP"

tes = re.findall(r'((?:\S+/IN\s\w+/NNP\s*)+)', sentence)
print(tes)

So the sentence contain words di/IN Jln/NN Terusan/NNP Borobudur/NNP and di/IN Jalan/NN Perempatan/ADJ Malioboro/NNP that I like to extract. The expected result:

['di/IN Jln/NN Terusan/NNP Borobudur/NNP', 'di/IN Jalan/NN Perempatan/ADJ Malioboro/NNP']

What is the best way to do this task? thanks.

Upvotes: 2

Views: 2316

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626691

Use

r'\S+/IN\b(?:(?!\S+/IN\b).)+\S+/NNP\b'

See the regex demo

Details

  • \S+ - 1+ non-whitespace symbols
  • /IN\b - a /IN substring as a whole word
  • (?:(?!\S+/IN\b).)+ - any 1+ chars other than line break chars that do not match the \S+/IN\b pattern sequence (use re.DOTALL to match line breaks, too)
  • \S+/NNP\b - 1+ non-whitespaces, /NNP as a whole word (since \b is a word boundary)

Upvotes: 1

Related Questions