Reputation: 81278
Although I know I could use some hugeass regex such as the one posted here I'm wondering if there is some tweaky as hell way to do this either with a standard module or perhaps some third-party add-on?
Simple question, but nothing jumped out on Google (or Stackoverflow).
Look forward to seeing how y'all do this!
Upvotes: 26
Views: 53911
Reputation: 3388
I know that it's exactly what you do not want but here's a file with a huge regex:
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
the web url matching regex used by markdown
http://daringfireball.net/2010/07/improved_regex_for_matching_urls
https://gist.github.com/gruber/8891611
"""
URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""
I call that file urlmarker.py
and when I need it I just import it, eg.
import urlmarker
import re
re.findall(urlmarker.URL_REGEX,'some text news.yahoo.com more text')
cf. http://daringfireball.net/2010/07/improved_regex_for_matching_urls
Also, here is what Django (1.6) uses to validate URLField
s:
regex = re.compile(
r'^(?:http|ftp)s?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' # domain...
r'localhost|' # localhost...
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|' # ...or ipv4
r'\[?[A-F0-9]*:[A-F0-9:]+\]?)' # ...or ipv6
r'(?::\d+)?' # optional port
r'(?:/?|[/?]\S+)$', re.IGNORECASE)
cf. https://github.com/django/django/blob/1.6/django/core/validators.py#L43-50
Django 1.9 has that logic split across a few classes
Upvotes: 21
Reputation: 371
There is another way how to extract URLs from text easily. You can use urlextract to do it for you, just install it via pip:
pip install urlextract
and then you can use it like this:
from urlextract import URLExtract
extractor = URLExtract()
urls = extractor.find_urls("Let's have URL stackoverflow.com as an example.")
print(urls) # prints: ['stackoverflow.com']
You can find more info on my github page: https://github.com/lipoja/URLExtract
NOTE: It downloads list of TLDs from iana.org to keep you up to date. But if the program does not have internet access then its not for you.
This approach is similar as in urlextractor (mentioned above), but my code is recent, maintained and I am open for any suggestions (new features).
Upvotes: 1
Reputation: 195
import re
text = '<p>Please click <a href="http://www.dr-chuck.com">here</a></p>'
aa=re.findall('href="(.+)"',text)
print(aa)
Upvotes: 0
Reputation: 2543
I'm late to the party, but here is a solution someone from #python on freenode suggested to me. It avoids the regex hassle.
from urlparse import urlparse
def extract_urls(text):
"""Return a list of urls from a text string."""
out = []
for word in text.split(' '):
thing = urlparse(word.strip())
if thing.scheme:
out.append(word)
return out
Upvotes: 3
Reputation: 7093
...which can be found at this page: In search of the perfect URL validation regex.
The Diego Perini regex, which passed all the tests, is very long but is available at his gist here.
Note that you will have to convert his PHP version to python regex (there are slight differences).
I ended up using the Imme Emosol version which passes the vast majority of tests and is a fraction of the size of Diego Perini's.
Here is a python-compatible version of the Imme Emosol regex:
r'^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$'
Upvotes: 14
Reputation: 20382
Look at Django's approach here: django.utils.urlize()
. Regexps are too limited for the job and you have to use heuristics to get results that are mostly right.
Upvotes: 14
Reputation: 591
You can use this library I wrote:
https://github.com/imranghory/urlextractor
It's extremely hacky, but it doesn't rely upon "http://" like many other techniques, rather it uses the Mozilla TLD list (via the tldextract library) to search for TLDs (i.e ".co.uk", ".com", etc.) in the text and then attempts to construct urls around the TLD.
It doesn't aim to be RFC compliant but rather accurate for how urls are used in practice in the real world. So for example it will reject the technically valid domain "com" (you can actually use a TLD as a domain; although it's rare in practice) and will strip trail full-stops or commas from urls.
Upvotes: 7
Reputation: 17805
You can use BeautifulSoup.
def extractlinks(html):
soup = BeautifulSoup(html)
anchors = soup.findAll('a')
links = []
for a in anchors:
links.append(a['href'])
return links
Note that the solution with regexes is faster, although will not be as accurate.
Upvotes: 4
Reputation: 141
if you know that there is a URL following a space in the string you can do something like this:
s is the string containg the url
>>> t = s[s.find("http://"):]
>>> t = t[:t.find(" ")]
otherwise you need to check if find returns -1 or not.
Upvotes: 6