Reputation: 343
Suppose I am given a string like:
input = """
[email protected] is a very nice person
[email protected] sucks
[email protected] is pretty funny."""
I have a regular expression for email addresses: ^[A-z0-9\+\.]+\@[A-z0-9\+\.]+\.[A-z0-9\+]+$
The goal is to split the string based on the email address regular expression. The output should be:
["is a very nice person", "sucks", "is pretty funny."]
I have been trying to use re.split(EMAIL_REGEX, input)
but i haven't been successful.
I get the output as the entire string contained in the list.
Upvotes: 1
Views: 290
Reputation: 765
Wouldn't use a regex here, it would work like this as well:
messages = []
for item in input.split('\n'):
item = ' '.join(item.split(' ')[1:]) #removes everything before the first space, which is just the email-address in this case
messages.append(item)
Output of messages
when using:
input = """
[email protected] is a very nice person
[email protected] sucks
[email protected] is pretty funny."""
['', 'is a very nice person', 'sucks', 'is pretty funny.']
If you want to remove the first element, just do it like this: messages = messages[1:]
Upvotes: 0
Reputation: 12326
You will probably need to loop through the lines and then split each with your regex.
Also your regex shouldn't have $
at the end.
Try something like:
EMAIL_REGEX = r"\.[a-z]{3} " # just for the demo note the space
ends =[]
for L in input.split("\n"):
parts = re.split(EMAIL_REGEX,L)
if len(parts) > 1:
ends.append(parts[1])
Output:
['is a very nice person', 'sucks', 'is pretty funny.']
Upvotes: 0
Reputation: 11545
The '^$'
might be throwing it off. It'll only match string that starts and ends with the matching regex.
I have something close to what you want:
>>> EMAIL_REGEX = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
>>> re.split(EMAIL_REGEX, input, flags=re.IGNORECASE)
['\n', ' is a very nice person\n', ' sucks\n', ' is pretty funny.']
Upvotes: 1
Reputation: 782409
Remove the ^
and $
anchors, as they only match the beginning and end of the string. Since the email addresses are in the middle of the string, they'll never match.
Your regexp has other problems. The account name can contain many other characters than the ones you allow, e.g. _
and -
. The domain name can contain -
characters, but not +
. And you shouldn't use the range A-z
to get upper and lower case characters, because there are characters between the two alphabetic blocks that you probably don't want to include (see the ASCII Table); either use A-Za-z
or use a-z
and add flags = re.IGNORECASE
.
Upvotes: 4