Reputation: 589
if one particular word does not end with another particular word, leave it. here is my string:
x = 'john got shot dead. john with his .... ? , john got killed or died in 1990. john with his wife dead or died'
i want to print and count all words between john
and dead or death or died.
if john
does not end with any of the died or dead or death
words. leave it. start again with john word.
my code :
x = re.sub(r'[^\w]', ' ', x) # removed all dots, commas, special symbols
for i in re.findall(r'(?<=john)' + '(.*?)' + '(?=dead|died|death)', x):
print i
print len([word for word in i.split()])
my output:
got shot
2
with his john got killed or
6
with his wife
3
output which i want:
got shot
2
got killed or
3
with his wife
3
i don't know where i am doing mistake. it is just a sample input. i have to check with 20,000 inputs at a time.
Upvotes: 3
Views: 700
Reputation: 785128
You can use this negative lookahead regex:
>>> for i in re.findall(r'(?<=john)(?:(?!john).)*?(?=dead|died|death)', x):
... print i.strip()
... print len([word for word in i.split()])
...
got shot
2
got killed or
3
with his wife
3
Instead of your .*?
this regex is using (?:(?!john).)*?
which will lazily match 0 or more of any characters only when john
is not present in this match.
I also suggest using word boundaries to make it match complete words:
re.findall(r'(?<=\bjohn\b)(?:(?!\bjohn\b).)*?(?=\b(?:dead|died|death)\b)', x)
Upvotes: 2
Reputation: 5210
I assume, you want to start over, when there is another john
following in your string before dead|died|death
occur.
Then, you can split your string by the word john
and start matching on the resulting parts afterwards:
x = 'john got shot dead. john with his .... ? , john got killed or died in 1990. john with his wife dead or died'
x = re.sub('\W+', ' ', re.sub('[^\w ]', '', x)).strip()
for e in x.split('john'):
m = re.match('(.+?)(dead|died|death)', e)
if m:
print(m.group(1))
print(len(m.group(1).split()))
yields:
got shot
2
got killed or
3
with his wife
3
Also, note that after the replacements I propose here (before splitting and matching), the string looks like this:
john got shot dead john with his john got killed or died in 1990 john with his wife dead or died
I.e., there are no multiple whitespaces left in a sequence. You manage this by splitting by a whitespace later, but I feel this is a bit cleaner.
Upvotes: 2