efsee
efsee

Reputation: 629

Iterate and match all elements with regex

So I have something like this:

data = ['Alice Smith and Bob', 'Tim with Sam Dunken', 'Uncle Neo & 31']

I want to replace every element with the first name so it would look like this:

data = ['Alice Smith', 'Tim', 'Uncle Neo']

So far I got:

for i in range(len(data)):
    if re.match('(.*) and|with|\&', data[i]):
        a = re.match('(.*) and|with|\&', data[i])
        data[i] = a.group(1)

But it doesn't seem to work, I think it's because of my pattern but I can't figure out the right way to do this.

Upvotes: 1

Views: 1095

Answers (5)

ctwheels
ctwheels

Reputation: 22817

Brief

I would suggest using Casimir's answer if possible, but, if you are not sure what word might follow (that is to say that and, with, and & are dynamic), then you can use this regex.

Note: This regex will not work for some special cases such as names with apostrophes ' or dashes -, but you can add them to the character list that you're searching for. This answer also depends on the name beginning with an uppercase character and the "union word" as I'll name it (and, with, &, etc.) not beginning with an uppercase character.


Code

See this regex in use here

Regex

^((?:[A-Z][a-z]*\s*)+)\s.*

Substitution

$1

Result

Input

Alice Smith and Bob
Tim with Sam Dunken
Uncle Neo & 31

Output

Alice Smith
Tim
Uncle Neo

Explanation

  • Assert position at the beginning of the string ^
  • Match a capital alpha character [A-Z]
  • Match between any number of lowercase alpha characters [a-z]*
  • Match between any number of whitespace characters (you can specify spaces if you'd prefer using * instead) \s*
  • Match the above conditions between one and unlimited times, all captured into capture group 1 (...)+: where ... contains everything above
  • Match a whitespace character, followed by any character (except new line) any number of times
  • $1: Replace with capture group 1

Upvotes: 0

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

Simplify your approach to the following:

import re

data = ['Alice Smith and Bob', 'Tim with Sam Dunken', 'Uncle Neo & 31']
data = [re.search(r'.*(?= (and|with|&))', i).group() for i in data]

print(data)

The output:

['Alice Smith', 'Tim', 'Uncle Neo']

  • .*(?= (and|with|&)) - positive lookahead assertion, ensures that name/surname .* is followed by any item from the alternation group (and|with|&)

Upvotes: 0

Ajax1234
Ajax1234

Reputation: 71451

You can try this:

import re
data = ['Alice Smith and Bob', 'Tim with Sam Dunken', 'Uncle Neo & 31']
final_data = [re.sub('\sand.*?$|\s&.*?$|\swith.*?$', '', i) for i in data]

Output:

['Alice Smith', 'Tim', 'Uncle Neo']

Upvotes: 0

Jean-François Fabre
Jean-François Fabre

Reputation: 140168

The | needs grouping with parentheses in your attempt. Anyway, it's too complex.

I would just use re.sub to remove the separation word & the rest:

data = [re.sub(" (and|with|&) .*","",d) for d in data]

result:

['Alice Smith', 'Tim', 'Uncle Neo']

Upvotes: 0

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89547

Use a list comprehension with re.split:

result = [re.split(r' (?:and|with|&) ', x)[0] for x in data]

Upvotes: 2

Related Questions