Vivek Kothari
Vivek Kothari

Reputation: 472

Detect URL and add anchor tags using BeautifulSoup

I am parsing html using BeautifulSoup. Given following HTML:

<!DOCTYPE html>
<html>
    <body>
        <p>An absolute URL: https://www.w3schools.com</p>
    </body>
</html>

I wish to convert it to:

<!DOCTYPE html>
<html>
    <body>
        <p>An absolute URL: <a href="https://www.w3schools.com" target="_blank">https://www.w3schools.com</a></p>
    </body>
</html>

Code written so far:

def detect_urls_and_update_target(self, root): //root is the soup object
        for tag in root.find_all(True):
            if tag.name == 'a':
                if not tag.has_attr('target'):
                    tag.attrs['target'] = '_blank'
            elif tag.string is not None:
                  for url in re.findall(self.url_regex, tag.string): //regex which detects URLS which works
                      new_tag = root.new_tag("a", href=url, target="_blank")
                      new_tag.string = url
                      tag.append(new_tag)

This adds the required anchor tag but I am not able to figure out how do I remove the original URL from the tag.

Upvotes: 2

Views: 1334

Answers (2)

Martin Evans
Martin Evans

Reputation: 46759

You could use BeautifulSoup to reconstruct the parent contents as follows:

from bs4 import BeautifulSoup
import re

html = """<!DOCTYPE html>
<html>
    <body>
        <p>An absolute URL: https://www.w3schools.com</p>
        <p>Another link: https://stackoverflow.com/questions/50413693/detect-url-and-add-anchor-tags-using-beautifulsoup%22</p>
        <div><div>some div</div>Hello world from https://www.google.com</div>
    </body>
</html>"""

soup = BeautifulSoup(html, "html.parser")
re_url = re.compile(r'(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)')

for tag in soup.find_all(text=True):
    tags = []
    url = False

    for t in re_url.split(tag.string):
        if re_url.match(t):
            a = soup.new_tag("a", href=t, target='_blank')
            a.string = t
            tags.append(a)
            url = True
        else:
            tags.append(t)

    if url:
        for t in tags:
            tag.insert_before(t)
        tag.extract()

print(soup)    
print()

This would display the following output:

<!DOCTYPE html>

<html>
<body>
<p>An absolute URL: <a href="https://www.w3schools.com" target="_blank">https://www.w3schools.com</a></p>
<p>Another link: <a href="https://stackoverflow.com/questions/50413693/detect-url-and-add-anchor-tags-using-beautifulsoup%22" target="_blank">https://stackoverflow.com/questions/50413693/detect-url-and-add-anchor-tags-using-beautifulsoup%22</a></p>
<div><div>some div</div>Hello world from <a href="https://www.google.com" target="_blank">https://www.google.com</a></div>
</body>
</html>

This works by first splitting any tags containing text using a regular expression to find any URL. For each entry, if it is a URL, replace it in the list with a new anchor tag. If no URLs are found, leave the tag alone. Next insert each list of updated tags before the existing tag and then remove the existing tag.


To skip over any URLs in the DOCTYPE, the find_all() could be altered as follows:

from bs4 import BeautifulSoup, Doctype

...

for tag in soup.find_all(string=lambda text: not isinstance(text, Doctype)):

Upvotes: 2

Ajax1234
Ajax1234

Reputation: 71451

You can use re.sub with a decorator to wrap any occurrences of a url in a tag body with provided parameters:

import re
def format_hrefs(tags=['p'], _target='blank', a_class=''):
   def outer(f):
     def format_url(url):
        _start = re.sub('https*://www\.[\w\W]+\.\w{3}', '{}', url)
        return _start.format(*['<a href="{}" _target="{}" class="{}">{}</a>'.format(i, _target, a_class, i) for i in re.findall('https*://www\.\w+\.\w{3}', url)])
     def wrapper():
        url = f()
        _format = re.sub('|'.join('(?<=\<'+i+'\>)[\w\W]+(?=\</'+i+'\>)' for i in tags), '{}', html)
        _text = re.findall('|'.join('(?<=\<'+i+'\>)[\w\W]+(?=\</'+i+'\>)' for i in tags), html)
        return _format.format(*[format_url(i) for i in _text])
     return wrapper
   return outer

@format_hrefs()
def get_html():
   content = """
     <!DOCTYPE html>
     <html>
      <body>
        <p>An absolute URL: https://www.w3schools.com</p>
      </body>
     </html>
    """
   return content

print(get_html())

Output:

<!DOCTYPE html>
 <html>
  <body>
   <p>An absolute URL: <a href="https://www.w3schools.com" _target="blank" class="">https://www.w3schools.com</a></p>
  </body>
</html>

Upvotes: 1

Related Questions