Dharmraj
Dharmraj

Reputation: 586

How do I convert URL in a plain text to clickable links using python?

I have plain text, for example, take into account the following sentence :

I was surfing www.google.com and I found an interesting site www.stackoverflow.com. It's amazing!

In above example, www.google.com is a plain text, which I need to be converted like www.google.com (wrapped within anchor tag, having link to google.com). While, www.stackoverflow.com is already within anchor tag, which I want to keep intact. How can I do this using Python regular-expression ?

Upvotes: 3

Views: 3624

Answers (1)

M. Volf
M. Volf

Reputation: 1482

This task has to be splitted into two parts:

  • extract all text which is not already in a tag
  • find (or better say guess) all urls in that text and wrap them

For the first part, I recommend to go with BeautifulSoup. You could also use html.parser, but that would be a lot of extra work

Use a recursive function for finding the text:

from bs4 import BeautifulSoup
from bs4.element import NavigableString

your_text = """I was surfing <a href="...">www.google.com</a>, and I found an
interesting site https://www.stackoverflow.com/. It's amazing! I also liked
Heroku (http://heroku.com/pricing)
more.domains.tld/at-the-end-of-line
https://at-the_end_of-text.com"""

soup = BeautifulSoup(your_text, "html.parser")

def wrap_plaintext_links(bs_tag):
    for element in bs_tag.children:
        if type(element) == NavigableString:
            pass # now we have a text node, process it
        # so it is a Tag (or the soup object, which is for most purposes a tag as well)
        elif element.name != "a": # if it isn't the a tag, process it recursively
            wrap_plaintext_links(element)

wrap_plaintext_links(soup) # call the recursive function

You can test it finds only the values you want by replacing pass with print(element).


Now the finding urls and replacing itself. The complexity of the used regular expression really depends on how precise do you want to be. I would go with this:

(https?://)?        # match http(s):// in separate group if present
(                   # start of the main capturing group, what will be between the tags
  (?:[\w-]+\.)+     #   at least one domain and any subdomains before TLD
  [a-z]+            #   TLD
  (?:/\S*?)?        #   /[anything except whitespace] if present - URL path
)                   # end of the group
(?=[\.,)]?(?:\s|$)) # prevent matching any of ".,)" that might appear immediately after the URL as the text goes...

The function and code additions including replacing:

import re

def create_replacement(matchobj):
    if matchobj.group(1): # if there's http(s)://, keep it
        full_url = matchobj.group(0)
    else: # otherwise prepend it. it would be a long discussion if https or http. decide.
        full_url = "http://" + matchobj.group(2)
    tag = soup.new_tag("a", href=full_url)
    tag.string = matchobj.group(2)
    return str(tag)

# compile the pattern beforehand, as it's going to be used many times
r = re.compile(r"(https?://)?((?:[\w-]+\.)+[a-z]+(?:/\S*?)?)(?=[\.,)]?(?:\s|$))")

def wrap_plaintext_links(bs_tag):
    for element in bs_tag.children:
        if type(element) == NavigableString:
            replaced = r.sub(create_replacement, str(element))
            element.replaceWith(BeautifulSoup(replaced)) # make it a Soup so that the tags aren't escaped
        elif element.name != "a":
            wrap_plaintext_links(element)

Note: you can also include the pattern explanation in the code as I wrote it above, see the re.X flag

Upvotes: 2

Related Questions