Reputation: 819
I have a sentence:
text = "Alun-alun/NNP Jombang/NNP tepatnya/RB Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP pagi/NN :/: :/: :/: Minggu/NNP"
I like to extract any word from tag /IN
until last word with /NNP
tag.
The code so far can extract the Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP pagi/NN :/: :/: :/: Minggu/NNP
. But I want it to stop if the code meet either /:
or /IN
tag. Here is the code so far:
import re
def entityExtract(text):
# text = re.findall(r'([^\s/]*/IN\b[^/]*(?:/(?!IN\b)[^/]*)*/NNP\b)', text)
text = re.findall(r'([^\s/]*/IN\b[^/]*(?:/(?!IN\b)[^/]*)*/(?:NNP|CDP)\b)', text)
return text
text = "Alun-alun/NNP Jombang/NNP tepatnya/RB Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP pagi/NN :/: :/: :/: Minggu/NNP"
extract = entityExtract(text)
print text
print extract
Output:
['Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP pagi/NN :/: :/: :/: Minggu/NNP']
Expected result is:
['Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP]
What is the best way to solve it?
Upvotes: 0
Views: 370
Reputation: 2882
I've looked at the answer from @bulbus and the regex that @ytomo showed in the comments, which is:
[^\s/]*/IN\b[^/]*(?:/(?!IN\b|:\b)[^/]*\b)*/(?:NNP|CDP)\b
My problem is, this one - and the other proposals - do not follow a logic order to create a regex for the problem at hand. Let me show you:
The first part, before the repeating group [^\s/]*/IN\b[^/]*
which I'm going to simplify to \w+/IN\b
[^/]*' matches more than you should want to. Look at example 1.
What you're solving here, in words, is:
Translate that directly to a regex and you'll come up with a more readable version. (JMHO)
\w+/IN\b(\s[^/]+/[^\s]+)
read first group after IN-group (example 2)\w+/IN\b(\s[^/]+/[^\s]+)*
repeat that second group (example 3)\w+/IN\b(\s[^:/]+/(?!IN|:)[^\s]+)*
ignore :/: and \w+/IN groups (example 4)\w+/IN\b(\s[^:/]+/(?!IN|:)[^\s]+)*\s\w+/(NNP|CDP)\b
Make sure your last group is NNP or CDP (example 5)If we compare this one to the proposed result of @ytomo in the comments of the preceding answer, there seems to be not that much difference. However, the reason I even bothered to answer is, that a regex should readable and according to some logic. Your code is going to be in production tomorrow, and - when your code breaks - someone has to check it under some time pressure.
Upvotes: 1
Reputation: 2327
[^\s/]*/IN\b([^/]*/(?!IN\b|:\b)[^\s^/]*\b)*[^/]*/NNP\b
Am as confused as @DYZ about where you want to stop, so I based my regex on your output.
I believe you want to extract 'word/tag'
sections of the string and word+tag
are strongly coupled.
Where you want to stop your tag at without including it is controlled by this group (?!IN\b|:\b|NN\b)
Check regex here
Upvotes: 2