Giuseppe Lodi Rizzini
Giuseppe Lodi Rizzini

Reputation: 1055

Python split string by urls with and without a href

Actually my script working as expected (split string by url and maintain other text also) and put inside a list:

import re
s = 'This is my tweet check it out http://www.example.com/blah and http://blabla.com'
result = re.split(r'(https?://\S+)', s)
print(result)

Output:

['This is my tweet check it out ', 'http://www.example.com/blah', ' and ', 'http://blabla.com', '']

Now I'm stuck in another problem: sometimes I get urls as html, or mixed text+html, and url are like this:

<a href="http://www.example.com/full/path/to/product/">https://shorted.com/FJAKS</a>

href with full url, value between <a>...</a> the shortened url.

So I can receive a string like this to manipulate:

s = 'This is an html link: <a href="http://www.example.com/full/path/to/product/">https://shorted.com/FJAKS</a> and this is a text url: http://blabla.com'

I'd like to get the same logic for my function, but If I use:

result = re.split(r'(https?://\S+)', s)
print(result)

like before, I get this (WRONG):

['This is an html link: <a href="', 'http://www.example.com/full/path/to/product/">https://shorted.com/FJAKS</a>', ' and this is a text url: ', 'http://blabla.com', '']

But i'd like to get a situation like this (If it is an HTML, get all a tag):

Output expected:

['This is an html link: ', '<a href="http://www.example.com/full/path/to/product/">https://shorted.com/FJAKS</a>', ' and this is a text url: ', 'http://blabla.com', '']

Upvotes: 1

Views: 875

Answers (1)

CrazyChucky
CrazyChucky

Reputation: 3518

Try:

s = 'This is an html link: <a href="http://www.example.com/full/path/to/product/">https://shorted.com/FJAKS</a> and this is a text url: http://blabla.com'
result = re.split(r'((?:<a href=")?https?://\S+[^\s,.:;])', s)
print(result)

The key is the addition of (?:<a href=")?. (?:) means a group that isn't captured; it's useful so that the ? applies to that entire unit instead of a single character.

Note: a URL at the beginning or end creates a blank list item. If you'd like to remove those, try:

result = list(filter((None, result)))

EDIT: Added [^\s,.:;] to the end of the match. The ^ ensures we'll avoid matching the final character if it's any of the specified characters. This avoids links from gobbling up punctuation directly after them, like commas.

Upvotes: 1

Related Questions