Reputation: 1055
Actually my script working as expected (split string by url and maintain other text also) and put inside a list:
import re
s = 'This is my tweet check it out http://www.example.com/blah and http://blabla.com'
result = re.split(r'(https?://\S+)', s)
print(result)
Output:
['This is my tweet check it out ', 'http://www.example.com/blah', ' and ', 'http://blabla.com', '']
Now I'm stuck in another problem: sometimes I get urls as html, or mixed text+html, and url are like this:
<a href="http://www.example.com/full/path/to/product/">https://shorted.com/FJAKS</a>
href
with full url, value between <a>...</a>
the shortened url.
So I can receive a string like this to manipulate:
s = 'This is an html link: <a href="http://www.example.com/full/path/to/product/">https://shorted.com/FJAKS</a> and this is a text url: http://blabla.com'
I'd like to get the same logic for my function, but If I use:
result = re.split(r'(https?://\S+)', s)
print(result)
like before, I get this (WRONG):
['This is an html link: <a href="', 'http://www.example.com/full/path/to/product/">https://shorted.com/FJAKS</a>', ' and this is a text url: ', 'http://blabla.com', '']
But i'd like to get a situation like this (If it is an HTML, get all a
tag):
Output expected:
['This is an html link: ', '<a href="http://www.example.com/full/path/to/product/">https://shorted.com/FJAKS</a>', ' and this is a text url: ', 'http://blabla.com', '']
Upvotes: 1
Views: 875
Reputation: 3518
Try:
s = 'This is an html link: <a href="http://www.example.com/full/path/to/product/">https://shorted.com/FJAKS</a> and this is a text url: http://blabla.com'
result = re.split(r'((?:<a href=")?https?://\S+[^\s,.:;])', s)
print(result)
The key is the addition of (?:<a href=")?
. (?:)
means a group that isn't captured; it's useful so that the ?
applies to that entire unit instead of a single character.
Note: a URL at the beginning or end creates a blank list item. If you'd like to remove those, try:
result = list(filter((None, result)))
EDIT: Added [^\s,.:;]
to the end of the match. The ^
ensures we'll avoid matching the final character if it's any of the specified characters. This avoids links from gobbling up punctuation directly after them, like commas.
Upvotes: 1