MaxCore
MaxCore

Reputation: 2738

Python Regex: Replace all urls in string with <img> and <a> tags

I have a string with many urls to some pages and images:

La-la-la https://example.com/ la-la-la https://example.com/example.PNG

And I need to convert it to:

La-la-la <a href="https://example.com/">https://example.com/</a> la-la-la <img src="https://example.com/example.PNG">

Image formats are unpredictable, they can be .png .JPEG etc., and any links can be found multiple times per string

I understand, that there are some strange javascript examples here, but I can not get how to convert them to python.

But I found this as a starting point:

url_regex = /(\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])/ig img_regex = /^ftp|http|https?:\/\/(?:[a-z\-]+\.)+[a-z]{2,6}(?:\/[^\/#?]+)+\.(?:jpe?g|gif|png)$/ig

Big thx for help

Upvotes: 2

Views: 619

Answers (2)

Paolo
Paolo

Reputation: 26094

You may use the following regular expression:

(https?.*?\.com\/)(\s+[\w-]*\s+)(https?.*?\.com\/[\w\.]+)

  • (https?.*?\.com\/) First capture group. Capture http or https, anything up to .com and forward slash /.
  • (\s+[\w-]*\s+) Second capture group. Capture whitespace, alphanumerical characters and hypens, and whitespace. You can add more characters to the character set if needed.
  • (https?.*?\.com\/[\w\.]+) Third capture group. Capture http or https, anything up to .com, forward slash /, alphanumerical characters and full stop . for the extension. Again you can add more characters to the character set in this capture group if you are expecting other characters.

You can test the regex live here.

Alternatively, if you are expecting variable urls and domains you may use:

(\w*\:.*?\.\w*\/)(\s+[\w-]*\s+)(\w*\:?.*?\.\w*\/[\w\.]+)

Where first and third capture groups now do match any alphanumerical characters followed by colon :, and anything up to a ., alphanumerical characters \w and forward slash. You can test this here.

You may replace captured groups with:

<a href="\1">\1</a>\2<img src="\3">

Where \1, \2, and \3 are backreferences to captured groups one, two and three respectively.


Python snippet:

>>import re
>>str = "La-la-la https://example.com/ la-la-la https://example.com/example.PNG"

>>out = re.sub(r'(https?.*?\.com\/)(\s+[\w-]*\s+)(https?.*?\.com\/[\w\.]+)',
       r'<a href="\1">\1</a>\2<img src="\3">',
       str)
>>print(out)
La-la-la <a href="https://example.com/">https://example.com/</a> la-la-la <img src="https://example.com/example.PNG">

Upvotes: 1

Druta Ruslan
Druta Ruslan

Reputation: 7412

You can do this without regex, if you want.

stng = 'La-la-la https://example.com/ la-la-la https://example.com/example.PNG'

sentance = '{f_txt} <a href="{f_url}">{f_url}</a> {s_txt} <img src="{s_url}">'

f_txt, f_url, s_txt, s_url = stng.split()

print(sentance.format(f_txt=f_txt, f_url=f_url, s_txt=s_txt, s_url=s_url))

Output

La-la-la <a href="https://example.com/">https://example.com/</a> la-la-la <img src="https://example.com/example.PNG"> 

Upvotes: 1

Related Questions