id345678
id345678

Reputation: 107

split strings that contain more than one substring

I have a list of strings names

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']

I want to split the strings that contain more than one of the following substrings:

substrings = ['Vice president', 'Affiliate', 'Acquaintance']

More precicely, i want to split after the last character of the word that follows the substring

desired_output = ['Acquaintance Muller', 'Vice president Johnson', 'Affiliate Peterson', 'Acquaintance Dr. Rose']

I dont know how to implement 'more than one' condition into my code:

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']
substrings = re.compile(r'Vice\spresident|Affiliate|Acquaintance')
    splitted = []
    for i in names:
        if substrings in i:
            splitted.append([])
        splitted[-1].append(item)

Exception: when that last character is a point (e.g. Prof.), split after the second word following the substring.


update: names is more complex than i thought and follows

  1. the title-like-pattern already answered correctly ('Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose')
  2. until a second pattern of strings follows ('Mister Kelly, AWS')
  3. until a third pattern of strings follows until the end ('Dr. Birker, Secretary Dr. Dews, Member Miss Berg, Secretary for Relations Dr. Jakob, Secretary')

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose', 'Vice president Dr. John Mister Schmid, PRT Miss Robertson, FDU', 'Mister Kelly, AWS', 'Dr. Birker, Secretary Dr. Dews, Member Miss Berg, Secretary for Relations Dr. Jakob, Secretary']

Sometimes Secretary is followed by varying specifications. I dont care about these characters that sometimes follow Secretary until the next name occurs. They can be dropped. Of course 'Secretary' should be stored like in updated_output.

I created a - hopefully exhaustive - list specifications of the stuff that follows Secretary. Here is a representation of list: specifications = ['', ' of State', ' for Relations', ' for the Interior', ' for the Environment']

updated question: how can i account for the third pattern using the specification list?

updated_output = ['Acquaintance Muller', 'Vice president Johnson', 'Affiliate Peterson', 'Acquaintance Dr. Rose', 'Vice president Dr. John', 'Mister Schmid, PRT', 'Miss Robertson, FDU', 'Mister Kelly, AWS', 'Dr. Birker, Secretary of State', 'Dr. Dews, Member', 'Miss Berg, Secretary for Relations, 'Dr. Jakob, Secretary']

Upvotes: 1

Views: 304

Answers (2)

pho
pho

Reputation: 25489

You want to split at the word boundary just before one of those three titles, so you can look for a word boundary \b followed by a positive lookahead (?=...) for one of those titles:

>>> s = 'Vice president Johnson affiliate Peterson acquaintance Dr. Rose'
>>> v = re.split(r"\b(?=Vice president|affiliate|acquaintance)", s, flags=re.I)
    ['', 'Vice president Johnson ', 'affiliate Peterson ', 'acquaintance Dr. Rose']

Then, you can trim and discard the empty results:

>>> v = [x for i in v if (x := i.strip())]
    ['Vice president Johnson', 'affiliate Peterson', 'acquaintance Dr. Rose']

With a list of input strings, simply apply this treatment to all of them:

def get_names(s):
    v = re.split(r"\b(?=Vice president|affiliate|acquaintance)", s, flags=re.I)
    return [x for i in v if (x := i.strip())]


names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']

output = []
for n in names:
    output.extend(get_names(n))

Which gives:

output = ['Acquaintance Muller',
 'Vice president Johnson',
 'Affiliate Peterson',
 'Acquaintance Dr. Rose']

Upvotes: 2

Andrej Kesely
Andrej Kesely

Reputation: 195438

Try:

import re

names = [
    "acquaintance Muller",
    "Vice president Johnson affiliate Peterson acquaintance Dr. Rose",
]
substrings = ["Vice president", "affiliate", "acquaintance"]

r = re.compile("|".join(map(re.escape, substrings)))

out = []
for n in names:
    starts = [i.start() for i in r.finditer(n)]

    if not starts:
        out.append(n)
        continue

    if starts[0] != 0:
        starts = [0, *starts]

    starts.append(len(n))
    for a, b in zip(starts, starts[1::]):
        out.append(n[a:b])

print(out)

Prints:

['acquaintance Muller', 'Vice president Johnson ', 'affiliate Peterson ', 'acquaintance Dr. Rose']

Upvotes: 1

Related Questions