fassn
fassn

Reputation: 349

Split a string with one delimiter but multiple conditions

Good morning,

I found multiple threads dealing with splitting strings with multiple delimiters, but not with one delimiter and multiple conditions.

I want to split the following strings by sentences:

desc = Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry. She speaks both English and Polish.

If I do:

[t.split('. ') for t in desc]

I get:

['Dr', 'Anna Pytlik is an expert in conservative and aesthetic dentistry', 'She speaks both English and Polish.']

I don't want to split the first dot after 'Dr'. How can I add a list of substrings in which case the .split('. ') should not apply?

Thank you!

Upvotes: 1

Views: 262

Answers (2)

tobias_k
tobias_k

Reputation: 82899

You could use re.split with a negative lookbehind:

>>> desc = "Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry. She speaks both English and Polish."
>>> re.split(r"(?<!Dr|Mr)\. ", desc)
['Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry',
 'She speaks both English and Polish.']

Just add more "exceptions", delimited with |.


Update: Seems like negative lookbehind requires all the alternatives to have the same length, so this does not work with both "Dr." and "Prof." One workaround might be to pad the pattern with ., e.g. (?<!..Dr|..Mr|Prof). You could easily write a helper method to pad each title with as many . as needed. However, this may break if the very first word of the text is Dr., as the .. will not be matched.

Another workaround might be to first replace all the titles with some placeholders, e.g. "Dr." -> "{DR}" and "Prof." -> "{PROF}", then split, then swap the original titles back in. This way you don't even need regular expressions.

pairs = (("Dr.", "{DR}"), ("Prof.", "{PROF}")) # and some more
def subst_titles(s, reverse=False):
    for x, y in pairs:
        s = s.replace(*(x, y) if not reverse else (y, x))
    return s

Example:

>>> text = "Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry. Prof. Miller speaks both English and Polish."
>>> [subst_titles(s, True) for s in subst_titles(text).split(". ")]
['Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry', 'Prof. Miller speaks both English and Polish.']

Upvotes: 2

O. Nurtdinov
O. Nurtdinov

Reputation: 16

You could split and then join again Dr/Mr/... It doesn't need complicated regexes and could be faster (you should benchmark it to choose best option).

Upvotes: 0

Related Questions