Reputation: 39

Extracting three words before the first occurrence of a particular word

I have been trying to extract three words before the first occurrence of a particular word. For eg, Input: Kerala High Court Jurisdiction. Known Word: Jurisdiction. Output: Kerala High Court

I have tried the following regular exception, but it didn't work.

m = re.search("((?:\S+\s+){3,}\JURISDICTION\b\s*(?:\S+\b\s*){3,})",contents)
print(m)

Upvotes: 0

Answers (4)

The fourth bird

Reputation: 163362

About the pattern that you tried:

Using {3,} repeats 3 or more times instead of exactly 3
You don't have to escape the \J
The pattern ends with \s*(?:\S+\b\s*){3,} which means that the repeating pattern should be present after matching JURISDICTION
You use a capture group around the whole pattern, but instead you can capture only the part that you want, and match what should be present before (or also after it)

To extract 3 words before the first occurrence, you can use re.search, and use a capture group instead of a lookahead.

(\S+(?:\s+\S+){2})\s+JURISDICTION\b

The pattern matches:

( Capture group 1
- \S+ Match 1+ non whitespace chars
- (?:\s+\S+){2} Repeat 2 times matching 1+ whitespace chars and 1+ non whitspace chars
) Close group 1
\s+JURISDICTION\b Match 1+ whitespace chars, JURISDICTION followed by a word boundary

See a regex demo.

For example, using re.I for a case insensitive match:

import re

pattern = r"(\S+(?:\s+\S+){2})\s+JURISDICTION\b"
s = "Kerala High Court Jurisdiction"

m = re.search(pattern, s, re.I)

if m:
    print(m.group(1))

Output

Kerala High Court

Upvotes: 0

Cubix48

Reputation: 2681

Here is multiple ways to do so:

# Method 1
# Split the sentence into words and get the index of "Jurisdiction"
data = "Word Kerala High Court Jurisdiction"
words = data.split()
new_data = words[words.index('Jurisdiction')-3:words.index('Jurisdiction')]
print(new_data)  # ['Kerala', 'High', 'Court']

# Method 2
# Split the sentence to "Jurisdiction" and the text before into word
data = "Word Kerala High Court Jurisdiction"
new_data = data.split('Jurisdiction')[0].split()[-3:]
print(new_data)  # ['Kerala', 'High', 'Court']


# Method 3
# Using regex
import re

data = "Word Kerala High Court Jurisdiction"
new_data = re.search(r"(\w+\W+){3}(?=Jurisdiction)", data)
print(new_data.group())  # Kerala High Court

(){3}: capturing group, repeated 3 times.
- \w+: matches a word character between one and unlimited times.
- \W+: matches any character different than a word character between one and unlimited times.
(?=): Positive lookahead.
Jurisdiction: Matches Jurisdiction.

Upvotes: 1

Cubed

Reputation: 102

matches = re.findall(r'(?:\b\w+\s+){3}(?=Jurisdiction)', contents, flags = re.I)
for match in matched:
    print(match)

The expression looks for three words before the word 'Jurisdiction'.

re.I is to make it case insensitive.

You're supposed to use a forward look ahead (?=...) to check if the match precedes a pattern. You can remove ?= if you want to include the word Jurisdiction in your matches.

Upvotes: 0

anotherGatsby

Reputation: 1598

You can use re for this, the pattern could look like: ^([\w ]+)Jurisdiction

import re
s = """Kerala High Court Jurisdiction."""
print(re.findall(r"^([\w ]+)Jurisdiction", s)[0].strip().split())
# ['Kerala', 'High', 'Court']

Explanation:

re.findall(r"^([\w ]+)Jurisdiction", s)

gives you ['Kerala High Court ']

[0].strip().split()

Takes the first element of above list, strips the whitespaces and then splits it at whitespace.

Upvotes: 0

Extracting three words before the first occurrence of a particular word

Answers (4)

Related Questions