Reputation: 39
I have been trying to extract three words before the first occurrence of a particular word. For eg, Input: Kerala High Court Jurisdiction. Known Word: Jurisdiction. Output: Kerala High Court
I have tried the following regular exception, but it didn't work.
m = re.search("((?:\S+\s+){3,}\JURISDICTION\b\s*(?:\S+\b\s*){3,})",contents)
print(m)
Upvotes: 0
Views: 629
Reputation: 163362
About the pattern that you tried:
{3,}
repeats 3 or more times instead of exactly 3\J
\s*(?:\S+\b\s*){3,}
which means that the repeating pattern should be present after matching JURISDICTIONTo extract 3 words before the first occurrence, you can use re.search, and use a capture group instead of a lookahead.
(\S+(?:\s+\S+){2})\s+JURISDICTION\b
The pattern matches:
(
Capture group 1
\S+
Match 1+ non whitespace chars(?:\s+\S+){2}
Repeat 2 times matching 1+ whitespace chars and 1+ non whitspace chars)
Close group 1\s+JURISDICTION\b
Match 1+ whitespace chars, JURISDICTION followed by a word boundarySee a regex demo.
For example, using re.I
for a case insensitive match:
import re
pattern = r"(\S+(?:\s+\S+){2})\s+JURISDICTION\b"
s = "Kerala High Court Jurisdiction"
m = re.search(pattern, s, re.I)
if m:
print(m.group(1))
Output
Kerala High Court
Upvotes: 0
Reputation: 2681
Here is multiple ways to do so:
# Method 1
# Split the sentence into words and get the index of "Jurisdiction"
data = "Word Kerala High Court Jurisdiction"
words = data.split()
new_data = words[words.index('Jurisdiction')-3:words.index('Jurisdiction')]
print(new_data) # ['Kerala', 'High', 'Court']
# Method 2
# Split the sentence to "Jurisdiction" and the text before into word
data = "Word Kerala High Court Jurisdiction"
new_data = data.split('Jurisdiction')[0].split()[-3:]
print(new_data) # ['Kerala', 'High', 'Court']
# Method 3
# Using regex
import re
data = "Word Kerala High Court Jurisdiction"
new_data = re.search(r"(\w+\W+){3}(?=Jurisdiction)", data)
print(new_data.group()) # Kerala High Court
(){3}
: capturing group, repeated 3 times.
\w+
: matches a word character between one and unlimited times.\W+
: matches any character different than a word character between one and unlimited times.(?=)
: Positive lookahead.Jurisdiction
: Matches Jurisdiction
.Upvotes: 1
Reputation: 102
matches = re.findall(r'(?:\b\w+\s+){3}(?=Jurisdiction)', contents, flags = re.I)
for match in matched:
print(match)
The expression looks for three words before the word 'Jurisdiction'.
re.I
is to make it case insensitive.
You're supposed to use a forward look ahead (?=...)
to check if the match precedes a pattern. You can remove ?=
if you want to include the word Jurisdiction
in your matches.
Upvotes: 0
Reputation: 1598
You can use re
for this, the pattern could look like: ^([\w ]+)Jurisdiction
import re
s = """Kerala High Court Jurisdiction."""
print(re.findall(r"^([\w ]+)Jurisdiction", s)[0].strip().split())
# ['Kerala', 'High', 'Court']
Explanation:
re.findall(r"^([\w ]+)Jurisdiction", s)
gives you ['Kerala High Court ']
[0].strip().split()
Takes the first element of above list, strips the whitespaces and then splits it at whitespace.
Upvotes: 0