Reputation: 349
Good morning,
I found multiple threads dealing with splitting strings with multiple delimiters, but not with one delimiter and multiple conditions.
I want to split the following strings by sentences:
desc = Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry. She speaks both English and Polish.
If I do:
[t.split('. ') for t in desc]
I get:
['Dr', 'Anna Pytlik is an expert in conservative and aesthetic dentistry', 'She speaks both English and Polish.']
I don't want to split the first dot after 'Dr'. How can I add a list of substrings in which case the .split('. ') should not apply?
Thank you!
Upvotes: 1
Views: 262
Reputation: 82899
You could use re.split
with a negative lookbehind:
>>> desc = "Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry. She speaks both English and Polish."
>>> re.split(r"(?<!Dr|Mr)\. ", desc)
['Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry',
'She speaks both English and Polish.']
Just add more "exceptions", delimited with |
.
Update: Seems like negative lookbehind requires all the alternatives to have the same length, so this does not work with both "Dr." and "Prof." One workaround might be to pad the pattern with .
, e.g. (?<!..Dr|..Mr|Prof)
. You could easily write a helper method to pad each title with as many .
as needed. However, this may break if the very first word of the text is Dr., as the .. will not be matched.
Another workaround might be to first replace all the titles with some placeholders, e.g. "Dr."
-> "{DR}"
and "Prof."
-> "{PROF}"
, then split, then swap the original titles back in. This way you don't even need regular expressions.
pairs = (("Dr.", "{DR}"), ("Prof.", "{PROF}")) # and some more
def subst_titles(s, reverse=False):
for x, y in pairs:
s = s.replace(*(x, y) if not reverse else (y, x))
return s
Example:
>>> text = "Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry. Prof. Miller speaks both English and Polish."
>>> [subst_titles(s, True) for s in subst_titles(text).split(". ")]
['Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry', 'Prof. Miller speaks both English and Polish.']
Upvotes: 2
Reputation: 16
You could split and then join again Dr/Mr/... It doesn't need complicated regexes and could be faster (you should benchmark it to choose best option).
Upvotes: 0