Detect URL and add anchor tags using BeautifulSoup

Question

I am parsing html using BeautifulSoup. Given following HTML:



    
        An absolute URL: https://www.w3schools.com

I wish to convert it to:



    
        An absolute URL: https://www.w3schools.com

Code written so far:

def detect_urls_and_update_target(self, root): //root is the soup object
        for tag in root.find_all(True):
            if tag.name == 'a':
                if not tag.has_attr('target'):
                    tag.attrs['target'] = '_blank'
            elif tag.string is not None:
                  for url in re.findall(self.url_regex, tag.string): //regex which detects URLS which works
                      new_tag = root.new_tag("a", href=url, target="_blank")
                      new_tag.string = url
                      tag.append(new_tag)

This adds the required anchor tag but I am not able to figure out how do I remove the original URL from the tag.

Martin Evans · Accepted Answer

You could use BeautifulSoup to reconstruct the parent contents as follows:

from bs4 import BeautifulSoup
import re

html = """

    
        An absolute URL: https://www.w3schools.com
        Another link: https://stackoverflow.com/questions/50413693/detect-url-and-add-anchor-tags-using-beautifulsoup%22
        some div
Hello world from https://www.google.com
    
"""

soup = BeautifulSoup(html, "html.parser")
re_url = re.compile(r'(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*,]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)')

for tag in soup.find_all(text=True):
    tags = []
    url = False

    for t in re_url.split(tag.string):
        if re_url.match(t):
            a = soup.new_tag("a", href=t, target='_blank')
            a.string = t
            tags.append(a)
            url = True
        else:
            tags.append(t)

    if url:
        for t in tags:
            tag.insert_before(t)
        tag.extract()

print(soup)    
print()

This would display the following output:





An absolute URL: https://www.w3schools.com
Another link: https://stackoverflow.com/questions/50413693/detect-url-and-add-anchor-tags-using-beautifulsoup%22
some div
Hello world from https://www.google.com

This works by first splitting any tags containing text using a regular expression to find any URL. For each entry, if it is a URL, replace it in the list with a new anchor tag. If no URLs are found, leave the tag alone. Next insert each list of updated tags before the existing tag and then remove the existing tag.

To skip over any URLs in the DOCTYPE, the find_all() could be altered as follows:

from bs4 import BeautifulSoup, Doctype

...

for tag in soup.find_all(string=lambda text: not isinstance(text, Doctype)):

Detect URL and add anchor tags using BeautifulSoup

Answers (2)

Related Questions