jkp
jkp

Reputation: 81278

What's the cleanest way to extract URLs from a string using Python?

Although I know I could use some hugeass regex such as the one posted here I'm wondering if there is some tweaky as hell way to do this either with a standard module or perhaps some third-party add-on?

Simple question, but nothing jumped out on Google (or Stackoverflow).

Look forward to seeing how y'all do this!

Upvotes: 26

Views: 53911

Answers (9)

dranxo
dranxo

Reputation: 3388

I know that it's exactly what you do not want but here's a file with a huge regex:

#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
the web url matching regex used by markdown
http://daringfireball.net/2010/07/improved_regex_for_matching_urls
https://gist.github.com/gruber/8891611
"""
URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""

I call that file urlmarker.py and when I need it I just import it, eg.

import urlmarker
import re
re.findall(urlmarker.URL_REGEX,'some text news.yahoo.com more text')

cf. http://daringfireball.net/2010/07/improved_regex_for_matching_urls

Also, here is what Django (1.6) uses to validate URLFields:

regex = re.compile(
    r'^(?:http|ftp)s?://'  # http:// or https://
    r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|'  # domain...
    r'localhost|'  # localhost...
    r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|'  # ...or ipv4
    r'\[?[A-F0-9]*:[A-F0-9:]+\]?)'  # ...or ipv6
    r'(?::\d+)?'  # optional port
    r'(?:/?|[/?]\S+)$', re.IGNORECASE)

cf. https://github.com/django/django/blob/1.6/django/core/validators.py#L43-50

Django 1.9 has that logic split across a few classes

Upvotes: 21

Jan Lipovsk&#253;
Jan Lipovsk&#253;

Reputation: 371

There is another way how to extract URLs from text easily. You can use urlextract to do it for you, just install it via pip:

pip install urlextract

and then you can use it like this:

from urlextract import URLExtract

extractor = URLExtract()
urls = extractor.find_urls("Let's have URL stackoverflow.com as an example.")
print(urls) # prints: ['stackoverflow.com']

You can find more info on my github page: https://github.com/lipoja/URLExtract

NOTE: It downloads list of TLDs from iana.org to keep you up to date. But if the program does not have internet access then its not for you.

This approach is similar as in urlextractor (mentioned above), but my code is recent, maintained and I am open for any suggestions (new features).

Upvotes: 1

zlin
zlin

Reputation: 195

import re
text = '<p>Please click <a href="http://www.dr-chuck.com">here</a></p>'
aa=re.findall('href="(.+)"',text)
print(aa)

Upvotes: 0

Shatnerz
Shatnerz

Reputation: 2543

I'm late to the party, but here is a solution someone from #python on freenode suggested to me. It avoids the regex hassle.

from urlparse import urlparse

def extract_urls(text):
    """Return a list of urls from a text string."""
    out = []
    for word in text.split(' '):
        thing = urlparse(word.strip())
        if thing.scheme:
            out.append(word)
    return out

Upvotes: 3

tohster
tohster

Reputation: 7093

There is an excellent comparison of 13 different regex approaches

...which can be found at this page: In search of the perfect URL validation regex.

The Diego Perini regex, which passed all the tests, is very long but is available at his gist here.
Note that you will have to convert his PHP version to python regex (there are slight differences).

I ended up using the Imme Emosol version which passes the vast majority of tests and is a fraction of the size of Diego Perini's.

Here is a python-compatible version of the Imme Emosol regex:

r'^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$'

Upvotes: 14

Gaslight Deceive Subvert
Gaslight Deceive Subvert

Reputation: 20382

Look at Django's approach here: django.utils.urlize(). Regexps are too limited for the job and you have to use heuristics to get results that are mostly right.

Upvotes: 14

Imran
Imran

Reputation: 591

You can use this library I wrote:

https://github.com/imranghory/urlextractor

It's extremely hacky, but it doesn't rely upon "http://" like many other techniques, rather it uses the Mozilla TLD list (via the tldextract library) to search for TLDs (i.e ".co.uk", ".com", etc.) in the text and then attempts to construct urls around the TLD.

It doesn't aim to be RFC compliant but rather accurate for how urls are used in practice in the real world. So for example it will reject the technically valid domain "com" (you can actually use a TLD as a domain; although it's rare in practice) and will strip trail full-stops or commas from urls.

Upvotes: 7

Seb
Seb

Reputation: 17805

You can use BeautifulSoup.

def extractlinks(html):
    soup = BeautifulSoup(html)
    anchors = soup.findAll('a')
    links = []
    for a in anchors:
        links.append(a['href'])
    return links

Note that the solution with regexes is faster, although will not be as accurate.

Upvotes: 4

sinzi
sinzi

Reputation: 141

if you know that there is a URL following a space in the string you can do something like this:

s is the string containg the url

>>> t = s[s.find("http://"):]
>>> t = t[:t.find(" ")]

otherwise you need to check if find returns -1 or not.

Upvotes: 6

Related Questions