How to limit text extraction until specific character using regex and python

Question

I have a sentence:

text = "Alun-alun/NNP Jombang/NNP tepatnya/RB Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP pagi/NN :/: :/: :/: Minggu/NNP"

I like to extract any word from tag /IN until last word with /NNP tag.

The code so far can extract the Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP pagi/NN :/: :/: :/: Minggu/NNP. But I want it to stop if the code meet either /: or /IN tag. Here is the code so far:

import re

def entityExtract(text):
    # text = re.findall(r'([^\s/]*/IN\b[^/]*(?:/(?!IN\b)[^/]*)*/NNP\b)', text)
    text = re.findall(r'([^\s/]*/IN\b[^/]*(?:/(?!IN\b)[^/]*)*/(?:NNP|CDP)\b)', text)
    return text

text = "Alun-alun/NNP Jombang/NNP tepatnya/RB Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP pagi/NN :/: :/: :/: Minggu/NNP"

extract = entityExtract(text)

print text
print extract

Output:

['Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP pagi/NN :/: :/: :/: Minggu/NNP']

Expected result is:

['Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP]

What is the best way to solve it?

kaza · Accepted Answer

[^\s/]*/IN\b([^/]*/(?!IN\b|:\b)[^\s^/]*\b)*[^/]*/NNP\b

Am as confused as @DYZ about where you want to stop, so I based my regex on your output.
I believe you want to extract 'word/tag' sections of the string and word+tag are strongly coupled.

Where you want to stop your tag at without including it is controlled by this group (?!IN\b|:\b|NN\b)

Check regex here

How to limit text extraction until specific character using regex and python

Answers (2)

Related Questions