Python split string by urls with and without a href

Question

Actually my script working as expected (split string by url and maintain other text also) and put inside a list:

import re
s = 'This is my tweet check it out http://www.example.com/blah and http://blabla.com'
result = re.split(r'(https?://\S+)', s)
print(result)

Output:

['This is my tweet check it out ', 'http://www.example.com/blah', ' and ', 'http://blabla.com', '']

Now I'm stuck in another problem: sometimes I get urls as html, or mixed text+html, and url are like this:

https://shorted.com/FJAKS

href with full url, value between ... the shortened url.

So I can receive a string like this to manipulate:

s = 'This is an html link: https://shorted.com/FJAKS and this is a text url: http://blabla.com'

I'd like to get the same logic for my function, but If I use:

result = re.split(r'(https?://\S+)', s)
print(result)

like before, I get this (WRONG):

['This is an html link: https://shorted.com/FJAKS', ' and this is a text url: ', 'http://blabla.com', '']

But i'd like to get a situation like this (If it is an HTML, get all a tag):

Output expected:

['This is an html link: ', 'https://shorted.com/FJAKS', ' and this is a text url: ', 'http://blabla.com', '']

CrazyChucky · Accepted Answer

Try:

s = 'This is an html link: https://shorted.com/FJAKS and this is a text url: http://blabla.com'
result = re.split(r'((?:. (?:) means a group that isn't captured; it's useful so that the ? applies to that entire unit instead of a single character.

Note: a URL at the beginning or end creates a blank list item. If you'd like to remove those, try:

result = list(filter((None, result)))


EDIT: Added [^\s,.:;] to the end of the match. The ^ ensures we'll avoid matching the final character if it's any of the specified characters. This avoids links from gobbling up punctuation directly after them, like commas.

Python split string by urls with and without a href

Answers (1)

Related Questions