sthr
sthr

Reputation: 33

Python regex to replace URLs in text with links (conversion from PHP)

Could someone convert this PHP regex to Python? I tried it for several times with no success:

function convertLinks($text) {
    return preg_replace("/(?:(http:\/\/)|(www\.))(\S+\b\/?)([[:punct:]]*)(\s|$)/i",
    "<a href=\"http://$2$3\" rel=\"nofollow\">$1$2$3</a>$4$5", $text);
}

Edit: I found that [:punct:] can be replaced by [!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~], so I tried this:

def convertLinks(text):
    pat = re.compile(ur"""(?:(http://)|(www\.))(\S+\b\/?)([!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~]*)(\s|$)""", re.IGNORECASE)
    return pat.sub(ur'<a href=\"http://\2\3" rel=\"nofollow\">\1\2\3</a>\4\5', text)

but I received "unmatched group" error for convertLinks(u"Test www.example.com test").

Upvotes: 1

Views: 2596

Answers (2)

Martijn Pieters
Martijn Pieters

Reputation: 1121366

The expression uses some features that work differently in Python.

  • Python doesn't have a [[:punct:]] character group; I used a POSIX regex reference to expand it.

  • The expression uses optional groups; matching either http:// or www. at the start, but then uses both in the replacement. This will fail in Python. Solution: use a replacement function.

So to get the same functionality, you can use:

import re

_link = re.compile(r'(?:(http://)|(www\.))(\S+\b/?)([!"#$%&\'()*+,\-./:;<=>?@[\\\]^_`{|}~]*)(\s|$)', re.I)

def convertLinks(text): 
    def replace(match):
        groups = match.groups()
        protocol = groups[0] or ''  # may be None
        www_lead = groups[1] or ''  # may be None
        return '<a href="http://{1}{2}" rel="nofollow">{0}{1}{2}</a>{3}{4}'.format(
            protocol, www_lead, *groups[2:])
    return _link.sub(replace, text)

Demo:

>>> test = 'Some text with www.stackoverflow.com links in them like http://this.too/with/path?'
>>> convertLinks(test)
'Some text with <a href="http://www.stackoverflow.com" rel="nofollow">www.stackoverflow.com</a> links in them like <a href="http://this.too/with/path" rel="nofollow">http://this.too/with/path</a>?'

Upvotes: 2

TerryA
TerryA

Reputation: 59974

If you want to use regex in python, you should consider using the re module. In this example, specifically re.sub.

The syntax is something similar to:

output = re.sub(regular_expression, what_it_should_be_replaced_by, input)

Don't forget that re.sub() returns the substituted string.

Upvotes: 0

Related Questions