Reputation: 586
I have plain text, for example, take into account the following sentence :
I was surfing www.google.com and I found an interesting site www.stackoverflow.com. It's amazing!
In above example, www.google.com is a plain text, which I need to be converted like www.google.com (wrapped within anchor tag, having link to google.com). While, www.stackoverflow.com is already within anchor tag, which I want to keep intact. How can I do this using Python regular-expression ?
Upvotes: 3
Views: 3624
Reputation: 1482
This task has to be splitted into two parts:
a
tagFor the first part, I recommend to go with BeautifulSoup. You could also use html.parser
, but that would be a lot of extra work
Use a recursive function for finding the text:
from bs4 import BeautifulSoup
from bs4.element import NavigableString
your_text = """I was surfing <a href="...">www.google.com</a>, and I found an
interesting site https://www.stackoverflow.com/. It's amazing! I also liked
Heroku (http://heroku.com/pricing)
more.domains.tld/at-the-end-of-line
https://at-the_end_of-text.com"""
soup = BeautifulSoup(your_text, "html.parser")
def wrap_plaintext_links(bs_tag):
for element in bs_tag.children:
if type(element) == NavigableString:
pass # now we have a text node, process it
# so it is a Tag (or the soup object, which is for most purposes a tag as well)
elif element.name != "a": # if it isn't the a tag, process it recursively
wrap_plaintext_links(element)
wrap_plaintext_links(soup) # call the recursive function
You can test it finds only the values you want by replacing pass
with print(element)
.
Now the finding urls and replacing itself. The complexity of the used regular expression really depends on how precise do you want to be. I would go with this:
(https?://)? # match http(s):// in separate group if present
( # start of the main capturing group, what will be between the tags
(?:[\w-]+\.)+ # at least one domain and any subdomains before TLD
[a-z]+ # TLD
(?:/\S*?)? # /[anything except whitespace] if present - URL path
) # end of the group
(?=[\.,)]?(?:\s|$)) # prevent matching any of ".,)" that might appear immediately after the URL as the text goes...
The function and code additions including replacing:
import re
def create_replacement(matchobj):
if matchobj.group(1): # if there's http(s)://, keep it
full_url = matchobj.group(0)
else: # otherwise prepend it. it would be a long discussion if https or http. decide.
full_url = "http://" + matchobj.group(2)
tag = soup.new_tag("a", href=full_url)
tag.string = matchobj.group(2)
return str(tag)
# compile the pattern beforehand, as it's going to be used many times
r = re.compile(r"(https?://)?((?:[\w-]+\.)+[a-z]+(?:/\S*?)?)(?=[\.,)]?(?:\s|$))")
def wrap_plaintext_links(bs_tag):
for element in bs_tag.children:
if type(element) == NavigableString:
replaced = r.sub(create_replacement, str(element))
element.replaceWith(BeautifulSoup(replaced)) # make it a Soup so that the tags aren't escaped
elif element.name != "a":
wrap_plaintext_links(element)
Note: you can also include the pattern explanation in the code as I wrote it above, see the re.X
flag
Upvotes: 2