ytomo
ytomo

Reputation: 819

How to limit text extraction until specific character using regex and python

I have a sentence:

text = "Alun-alun/NNP Jombang/NNP tepatnya/RB Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP pagi/NN :/: :/: :/: Minggu/NNP"

I like to extract any word from tag /IN until last word with /NNP tag.

The code so far can extract the Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP pagi/NN :/: :/: :/: Minggu/NNP. But I want it to stop if the code meet either /: or /IN tag. Here is the code so far:

import re

def entityExtract(text):
    # text = re.findall(r'([^\s/]*/IN\b[^/]*(?:/(?!IN\b)[^/]*)*/NNP\b)', text)
    text = re.findall(r'([^\s/]*/IN\b[^/]*(?:/(?!IN\b)[^/]*)*/(?:NNP|CDP)\b)', text)
    return text

text = "Alun-alun/NNP Jombang/NNP tepatnya/RB Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP pagi/NN :/: :/: :/: Minggu/NNP"

extract = entityExtract(text)

print text
print extract

Output:

['Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP pagi/NN :/: :/: :/: Minggu/NNP']

Expected result is:

['Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP]

What is the best way to solve it?

Upvotes: 0

Views: 370

Answers (2)

Marc Lambrichs
Marc Lambrichs

Reputation: 2882

I've looked at the answer from @bulbus and the regex that @ytomo showed in the comments, which is:

[^\s/]*/IN\b[^/]*(?:/(?!IN\b|:\b)[^/]*\b)*/(?:NNP|CDP)\b

My problem is, this one - and the other proposals - do not follow a logic order to create a regex for the problem at hand. Let me show you:

The first part, before the repeating group [^\s/]*/IN\b[^/]* which I'm going to simplify to \w+/IN\b[^/]*' matches more than you should want to. Look at example 1.

What you're solving here, in words, is:

  • read a \w+/IN group
  • followed by any number of \s[^/]+/\w+ groups, that's not a \w+/IN\b
  • as long as you can read.....until
  • ....you've matched the last NNP or CDP group you can find.

Translate that directly to a regex and you'll come up with a more readable version. (JMHO)

  1. \w+/IN\b(\s[^/]+/[^\s]+) read first group after IN-group (example 2)
  2. \w+/IN\b(\s[^/]+/[^\s]+)* repeat that second group (example 3)
  3. \w+/IN\b(\s[^:/]+/(?!IN|:)[^\s]+)* ignore :/: and \w+/IN groups (example 4)
  4. \w+/IN\b(\s[^:/]+/(?!IN|:)[^\s]+)*\s\w+/(NNP|CDP)\b Make sure your last group is NNP or CDP (example 5)

If we compare this one to the proposed result of @ytomo in the comments of the preceding answer, there seems to be not that much difference. However, the reason I even bothered to answer is, that a regex should readable and according to some logic. Your code is going to be in production tomorrow, and - when your code breaks - someone has to check it under some time pressure.

Upvotes: 1

kaza
kaza

Reputation: 2327

[^\s/]*/IN\b([^/]*/(?!IN\b|:\b)[^\s^/]*\b)*[^/]*/NNP\b

Am as confused as @DYZ about where you want to stop, so I based my regex on your output.
I believe you want to extract 'word/tag' sections of the string and word+tag are strongly coupled.

Where you want to stop your tag at without including it is controlled by this group (?!IN\b|:\b|NN\b)

Check regex here

Upvotes: 2

Related Questions