TimLeung
TimLeung

Reputation: 3479

Find Hyperlinks in Text using Python (twitter related)

How can I parse text and find all instances of hyperlinks with a string? The hyperlink will not be in the html format of <a href="http://test.com">test</a> but just http://test.com

Secondly, I would like to then convert the original string and replace all instances of hyperlinks into clickable html hyperlinks.

I found an example in this thread:

Easiest way to convert a URL to a hyperlink in a C# string?

but was unable to reproduce it in python :(

Upvotes: 15

Views: 25030

Answers (5)

Jan Lipovsk&#253;
Jan Lipovsk&#253;

Reputation: 371

Have a look at urlextract.

You can install it running: pip install urlextract

from urlextract import URLExtract

extractor = URLExtract()
urls = extractor.find_urls("Text with URLs. Let's have URL janlipovsky.cz as an example.")
print(urls) # prints: ['janlipovsky.cz']

The main advantage is that urlextract will find URLs without specifying schema (http, ftp, etc.) It also has a lot of configuration options to tune in the extractor to fit your needs. Everything can be found in documentation.

Upvotes: 2

dfrankow
dfrankow

Reputation: 21469

Here is a much more sophisticated regexp from 2002.

@yoniLavi minified this to:

re.compile(r'\b(?:https?|telnet|gopher|file|wais|ftp):[\w/#~:.?+=&%@!\-.:?\\-]+?(?=[.:?\-]*(?:[^\w/#~:.?+=&%@!\-.:?\-]|$))')

Upvotes: 10

Kekoa
Kekoa

Reputation: 28250

Django also has a solution that doesn't just use regex. It is django.utils.html.urlize(). I found this to be very helpful, especially if you happen to be using django.

You can also extract the code to use in your own project.

Upvotes: 5

jmoz
jmoz

Reputation: 8006

Jinja2 (Flask uses this) has a filter urlize which does the same.

Docs

Upvotes: 2

maxyfc
maxyfc

Reputation: 11337

Here's a Python port of Easiest way to convert a URL to a hyperlink in a C# string?:

import re

myString = "This is my tweet check it out http://tinyurl.com/blah"

r = re.compile(r"(http://[^ ]+)")
print r.sub(r'<a href="\1">\1</a>', myString)

Output:

This is my tweet check it out <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a>

Upvotes: 23

Related Questions