Reputation: 472
I am parsing html using BeautifulSoup. Given following HTML:
<!DOCTYPE html>
<html>
<body>
<p>An absolute URL: https://www.w3schools.com</p>
</body>
</html>
I wish to convert it to:
<!DOCTYPE html>
<html>
<body>
<p>An absolute URL: <a href="https://www.w3schools.com" target="_blank">https://www.w3schools.com</a></p>
</body>
</html>
Code written so far:
def detect_urls_and_update_target(self, root): //root is the soup object
for tag in root.find_all(True):
if tag.name == 'a':
if not tag.has_attr('target'):
tag.attrs['target'] = '_blank'
elif tag.string is not None:
for url in re.findall(self.url_regex, tag.string): //regex which detects URLS which works
new_tag = root.new_tag("a", href=url, target="_blank")
new_tag.string = url
tag.append(new_tag)
This adds the required anchor tag but I am not able to figure out how do I remove the original URL from the tag.
Upvotes: 2
Views: 1334
Reputation: 46759
You could use BeautifulSoup to reconstruct the parent contents as follows:
from bs4 import BeautifulSoup
import re
html = """<!DOCTYPE html>
<html>
<body>
<p>An absolute URL: https://www.w3schools.com</p>
<p>Another link: https://stackoverflow.com/questions/50413693/detect-url-and-add-anchor-tags-using-beautifulsoup%22</p>
<div><div>some div</div>Hello world from https://www.google.com</div>
</body>
</html>"""
soup = BeautifulSoup(html, "html.parser")
re_url = re.compile(r'(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)')
for tag in soup.find_all(text=True):
tags = []
url = False
for t in re_url.split(tag.string):
if re_url.match(t):
a = soup.new_tag("a", href=t, target='_blank')
a.string = t
tags.append(a)
url = True
else:
tags.append(t)
if url:
for t in tags:
tag.insert_before(t)
tag.extract()
print(soup)
print()
This would display the following output:
<!DOCTYPE html>
<html>
<body>
<p>An absolute URL: <a href="https://www.w3schools.com" target="_blank">https://www.w3schools.com</a></p>
<p>Another link: <a href="https://stackoverflow.com/questions/50413693/detect-url-and-add-anchor-tags-using-beautifulsoup%22" target="_blank">https://stackoverflow.com/questions/50413693/detect-url-and-add-anchor-tags-using-beautifulsoup%22</a></p>
<div><div>some div</div>Hello world from <a href="https://www.google.com" target="_blank">https://www.google.com</a></div>
</body>
</html>
This works by first splitting any tags containing text using a regular expression to find any URL. For each entry, if it is a URL, replace it in the list with a new anchor tag. If no URLs are found, leave the tag alone. Next insert each list of updated tags before the existing tag and then remove the existing tag.
To skip over any URLs in the DOCTYPE
, the find_all()
could be altered as follows:
from bs4 import BeautifulSoup, Doctype
...
for tag in soup.find_all(string=lambda text: not isinstance(text, Doctype)):
Upvotes: 2
Reputation: 71451
You can use re.sub
with a decorator to wrap any occurrences of a url in a tag body with provided parameters:
import re
def format_hrefs(tags=['p'], _target='blank', a_class=''):
def outer(f):
def format_url(url):
_start = re.sub('https*://www\.[\w\W]+\.\w{3}', '{}', url)
return _start.format(*['<a href="{}" _target="{}" class="{}">{}</a>'.format(i, _target, a_class, i) for i in re.findall('https*://www\.\w+\.\w{3}', url)])
def wrapper():
url = f()
_format = re.sub('|'.join('(?<=\<'+i+'\>)[\w\W]+(?=\</'+i+'\>)' for i in tags), '{}', html)
_text = re.findall('|'.join('(?<=\<'+i+'\>)[\w\W]+(?=\</'+i+'\>)' for i in tags), html)
return _format.format(*[format_url(i) for i in _text])
return wrapper
return outer
@format_hrefs()
def get_html():
content = """
<!DOCTYPE html>
<html>
<body>
<p>An absolute URL: https://www.w3schools.com</p>
</body>
</html>
"""
return content
print(get_html())
Output:
<!DOCTYPE html>
<html>
<body>
<p>An absolute URL: <a href="https://www.w3schools.com" _target="blank" class="">https://www.w3schools.com</a></p>
</body>
</html>
Upvotes: 1