Reputation: 890
With the following code (a bit messy, I acknowledge) I separate a string by commas, but the condition is that it doesn't separate when the string contains comma separated single words, for example:
It doesn't separate "Yup, there's a reason why you want to hit the sack just minutes after climax"
but it separates "The increase in heart rate, which you get from masturbating, is directly beneficial to the circulation, and can reduce the likelihood of a heart attack"
into ['The increase in heart rate', 'which you get from masturbating', 'is directly beneficial to the circulation', 'and can reduce the likelihood of a heart attack']
The problem is the purpose of the code fails when it encounters with such a string: "When men ejaculate, it releases a slew of chemicals including oxytocin, vasopressin, and prolactin, all of which naturally help you hit the pillow."
I don't want a separation after oxytocin, but after prolactin. I need a regex to do that.
import os
import textwrap
import re
import io
from textblob import TextBlob
string = str(input_string)
listy= [x.strip() for x in string.split(',')]
listy = [x.replace('\n', '') for x in listy]
listy = [re.sub('(?<!\d)\.(?!\d)', '', x) for x in listy]
listy = filter(None, listy) # Remove any empty strings
newstring= []
for segment in listy:
wc = TextBlob(segment).word_counts
if listy[len(listy)-1] != segment:
if len(wc) > 3: # len(segment.split(' ')) > 7:
newstring.append(segment+"&&")
else:
newstring.append(segment+",")
else:
newstring.append(segment)
sep = [x.strip() for x in (' '.join(newstring)).split('&&')]
Upvotes: 0
Views: 582
Reputation: 2327
Consider the below..
mystr="When men ejaculate, it releases a slew of chemicals including oxytocin, vasopressin, and prolactin, all of which naturally help you hit the pillow."
rExp=r",(?!\s+(?:and\s+)?\w+,)"
mylst=re.compile(rExp).split(mystr)
print(mylst)
should give the below output..
['When men ejaculate', ' it releases a slew of chemicals including oxytocin, vasopressin, and prolactin', ' all of which naturally help you hit the pillow.']
Let's look at how we split the string...
,(?!\s+\w+,)
Use every comma that is not followed by((?!
-> negative look ahead) \s+\w+,
space and a word with comma.
The above would fail in case of vasopressin, and
as and
is not followed by ,
. So introduce a conditional and\s+
within.
,(?!\s+(?:and\s+)?\w+,)
Although I might want to use the below
,(?!\s+(?:(?:and|or)\s+)?\w+,)
Test regex here
Test code here
In essence consider replacing your line
listy= [x.strip() for x in string.split(',')]
with
listy= [x.strip() for x in re.split(r",(?!\s+(?:and\s+)?\w+,)",string)]
Upvotes: 1