whatisinaname
whatisinaname

Reputation: 343

Split string via regular expression

Suppose I am given a string like:

input = """
[email protected] is a very nice person
[email protected] sucks
[email protected] is pretty funny."""

I have a regular expression for email addresses: ^[A-z0-9\+\.]+\@[A-z0-9\+\.]+\.[A-z0-9\+]+$

The goal is to split the string based on the email address regular expression. The output should be:

["is a very nice person", "sucks", "is pretty funny."]

I have been trying to use re.split(EMAIL_REGEX, input) but i haven't been successful. I get the output as the entire string contained in the list.

Upvotes: 1

Views: 290

Answers (4)

Exceen
Exceen

Reputation: 765

Wouldn't use a regex here, it would work like this as well:

messages = []
for item in input.split('\n'):
    item = ' '.join(item.split(' ')[1:]) #removes everything before the first space, which is just the email-address in this case
    messages.append(item)

Output of messages when using:

input = """
[email protected] is a very nice person
[email protected] sucks
[email protected] is pretty funny."""
['', 'is a very nice person', 'sucks', 'is pretty funny.']

If you want to remove the first element, just do it like this: messages = messages[1:]

Upvotes: 0

beroe
beroe

Reputation: 12326

You will probably need to loop through the lines and then split each with your regex. Also your regex shouldn't have $ at the end.

Try something like:

EMAIL_REGEX = r"\.[a-z]{3} " # just for the demo note the space
ends =[]
for L in input.split("\n"):
   parts = re.split(EMAIL_REGEX,L)
   if len(parts) > 1:
       ends.append(parts[1])

Output:

['is a very nice person', 'sucks', 'is pretty funny.']

Upvotes: 0

Santa
Santa

Reputation: 11545

The '^$' might be throwing it off. It'll only match string that starts and ends with the matching regex.

I have something close to what you want:

>>> EMAIL_REGEX = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
>>> re.split(EMAIL_REGEX, input, flags=re.IGNORECASE)
['\n', ' is a very nice person\n', ' sucks\n', ' is pretty funny.']

Upvotes: 1

Barmar
Barmar

Reputation: 782409

Remove the ^ and $ anchors, as they only match the beginning and end of the string. Since the email addresses are in the middle of the string, they'll never match.

Your regexp has other problems. The account name can contain many other characters than the ones you allow, e.g. _ and -. The domain name can contain - characters, but not +. And you shouldn't use the range A-z to get upper and lower case characters, because there are characters between the two alphabetic blocks that you probably don't want to include (see the ASCII Table); either use A-Za-z or use a-z and add flags = re.IGNORECASE.

Upvotes: 4

Related Questions