Reputation: 107
I have a list of strings names
names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']
I want to split the strings that contain more than one of the following substrings:
substrings = ['Vice president', 'Affiliate', 'Acquaintance']
More precicely, i want to split after the last character of the word that follows the substring
desired_output = ['Acquaintance Muller', 'Vice president Johnson', 'Affiliate Peterson', 'Acquaintance Dr. Rose']
I dont know how to implement 'more than one' condition into my code:
names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']
substrings = re.compile(r'Vice\spresident|Affiliate|Acquaintance')
splitted = []
for i in names:
if substrings in i:
splitted.append([])
splitted[-1].append(item)
Exception: when that last character is a point (e.g. Prof.
), split after the second word following the substring.
update: names
is more complex than i thought and follows
'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose'
)'Mister Kelly, AWS'
)'Dr. Birker, Secretary Dr. Dews, Member Miss Berg, Secretary for Relations Dr. Jakob, Secretary'
)names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose', 'Vice president Dr. John Mister Schmid, PRT Miss Robertson, FDU', 'Mister Kelly, AWS', 'Dr. Birker, Secretary Dr. Dews, Member Miss Berg, Secretary for Relations Dr. Jakob, Secretary']
Sometimes Secretary
is followed by varying specifications. I dont care about these characters that sometimes follow Secretary
until the next name occurs. They can be dropped. Of course 'Secretary'
should be stored like in updated_output
.
I created a - hopefully exhaustive - list specifications
of the stuff that follows Secretary
. Here is a representation of list:
specifications = ['', ' of State', ' for Relations', ' for the Interior', ' for the Environment']
updated question: how can i account for the third pattern using the specification
list?
updated_output = ['Acquaintance Muller', 'Vice president Johnson', 'Affiliate Peterson', 'Acquaintance Dr. Rose', 'Vice president Dr. John', 'Mister Schmid, PRT', 'Miss Robertson, FDU', 'Mister Kelly, AWS', 'Dr. Birker, Secretary of State', 'Dr. Dews, Member', 'Miss Berg, Secretary for Relations, 'Dr. Jakob, Secretary']
Upvotes: 1
Views: 304
Reputation: 25489
You want to split at the word boundary just before one of those three titles, so you can look for a word boundary \b
followed by a positive lookahead (?=...)
for one of those titles:
>>> s = 'Vice president Johnson affiliate Peterson acquaintance Dr. Rose'
>>> v = re.split(r"\b(?=Vice president|affiliate|acquaintance)", s, flags=re.I)
['', 'Vice president Johnson ', 'affiliate Peterson ', 'acquaintance Dr. Rose']
Then, you can trim and discard the empty results:
>>> v = [x for i in v if (x := i.strip())]
['Vice president Johnson', 'affiliate Peterson', 'acquaintance Dr. Rose']
With a list of input strings, simply apply this treatment to all of them:
def get_names(s):
v = re.split(r"\b(?=Vice president|affiliate|acquaintance)", s, flags=re.I)
return [x for i in v if (x := i.strip())]
names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']
output = []
for n in names:
output.extend(get_names(n))
Which gives:
output = ['Acquaintance Muller',
'Vice president Johnson',
'Affiliate Peterson',
'Acquaintance Dr. Rose']
Upvotes: 2
Reputation: 195438
Try:
import re
names = [
"acquaintance Muller",
"Vice president Johnson affiliate Peterson acquaintance Dr. Rose",
]
substrings = ["Vice president", "affiliate", "acquaintance"]
r = re.compile("|".join(map(re.escape, substrings)))
out = []
for n in names:
starts = [i.start() for i in r.finditer(n)]
if not starts:
out.append(n)
continue
if starts[0] != 0:
starts = [0, *starts]
starts.append(len(n))
for a, b in zip(starts, starts[1::]):
out.append(n[a:b])
print(out)
Prints:
['acquaintance Muller', 'Vice president Johnson ', 'affiliate Peterson ', 'acquaintance Dr. Rose']
Upvotes: 1