Agus Sanjaya
Agus Sanjaya

Reputation: 963

Extract words preceding and following search terms

Suppose I have a text like the following.

The City of New York often called New York City or simply New York is the most populous city in the United States. With an estimated population of 8537673 distributed over a land area of about 3026 square miles (784 km2) New York City is also the most densely populated major city in the United States.

I want to locate the n words preceding and following occurrences of the a search term. For example, n=3 and search term="New York", then

1st occurrence:

2nd occurrence:

3rd occurence:

4th occurrence:

How can I do this using regex? I found a similar question here Extract words surrounding a search word but it does not consider multiple occurrences of the search term.

Attempts:

def search(text,n): 
word = r"\W*([\w]+)" 
groups = re.search(r'{}\W*{}{}'.format(wordn,'place',wordn), text).groups() return groups[:n],groups[n:]

Upvotes: 1

Views: 270

Answers (2)

Mustofa Rizwan
Mustofa Rizwan

Reputation: 10476

You may try the following:

((?:\w+\W+){3})(?=New York((?:\W+\w+){3}))

and get your values in group 1 and 2

Sample Source ( run here )

import re
regex = r"((?:\w+\W+){3})(?=New York((?:\W+\w+){3}))"

test_str = "The City of New York often called New York City or simply New York is the most populous city in the United States. With an estimated 2016 population of 8537673 distributed over a land area of about 3026 square miles (784 km2) New York City is also the most densely populated major city in the United States."
matches = re.finditer(regex, test_str)

for match in matches:
    print(re.sub(r'\W+', ' ', match.group(1))+"  <------>" +re.sub(r'\W+', ' ', match.group(2)))

Regex 101 Demo

Upvotes: 1

Tim Pietzcker
Tim Pietzcker

Reputation: 336428

You need to use a positive lookahead assertion in order to handle overlapping matches:

re.findall(r"((?:\w+\W+){3})(?=New York((?:\W+\w+){3}))", t)

Result:

[('The City of ', ' often called New'),
 ('York often called ', ' City or simply'),
 ('City or simply ', ' is the most'),
 ('miles (784 km2) ', ' City is also')]

Upvotes: 1

Related Questions