Ganesh_
Ganesh_

Reputation: 589

print words between two particular words in a given string

if one particular word does not end with another particular word, leave it. here is my string:

x = 'john got shot dead. john with his .... ? , john got killed or died in 1990. john with his wife dead or died'

i want to print and count all words between john and dead or death or died. if john does not end with any of the died or dead or death words. leave it. start again with john word.

my code :

x = re.sub(r'[^\w]', ' ', x)  # removed all dots, commas, special symbols

for i in re.findall(r'(?<=john)' + '(.*?)' + '(?=dead|died|death)', x):
    print i
    print len([word for word in i.split()])

my output:

 got shot 
2
 with his          john got killed or 
6
 with his wife 
3

output which i want:

got shot
2
got killed or
3
with his wife
3

i don't know where i am doing mistake. it is just a sample input. i have to check with 20,000 inputs at a time.

Upvotes: 3

Views: 700

Answers (2)

anubhava
anubhava

Reputation: 785128

You can use this negative lookahead regex:

>>> for i in re.findall(r'(?<=john)(?:(?!john).)*?(?=dead|died|death)', x):
...     print i.strip()
...     print len([word for word in i.split()])
...

got shot
2
got killed or
3
with his wife
3

Instead of your .*? this regex is using (?:(?!john).)*? which will lazily match 0 or more of any characters only when john is not present in this match.

I also suggest using word boundaries to make it match complete words:

re.findall(r'(?<=\bjohn\b)(?:(?!\bjohn\b).)*?(?=\b(?:dead|died|death)\b)', x)

Code Demo

Upvotes: 2

jbndlr
jbndlr

Reputation: 5210

I assume, you want to start over, when there is another john following in your string before dead|died|death occur.

Then, you can split your string by the word john and start matching on the resulting parts afterwards:

x = 'john got shot dead. john with his .... ? , john got killed or died in 1990. john with his wife dead or died'
x = re.sub('\W+', ' ', re.sub('[^\w ]', '', x)).strip()
for e in x.split('john'):
    m = re.match('(.+?)(dead|died|death)', e)
    if m:
        print(m.group(1))
        print(len(m.group(1).split()))

yields:

 got shot 
2
 got killed or 
3
 with his wife 
3

Also, note that after the replacements I propose here (before splitting and matching), the string looks like this:

john got shot dead john with his john got killed or died in 1990 john with his wife dead or died

I.e., there are no multiple whitespaces left in a sequence. You manage this by splitting by a whitespace later, but I feel this is a bit cleaner.

Upvotes: 2

Related Questions