user2305415
user2305415

Reputation: 172

In python, how to do regex which catches an url in a <a href tag?

I am trying to make a regex in Python which catches an url in a :

<a href tag

For example, if i take this :

<a href="http://www.simplyrecipes.com/recipes/broccoli_slaw_with_cranbery_orange_dressing/" n    title="Permalink to Broccoli Slaw with Cranberry Orange Dressing" rel="bookmark"><img    width="520" height="347" 

I need this expression to be catched:

<a href="http://www.simplyrecipes.com/recipes/broccoli_slaw_with_cranbery_orange_dressing/" 

So this is what i have done :

^<a href="http://www(???what to put in here????)"$

But i don't know how to traduct the part of the expression after www which must be included but not specially treated.

Thanks in advance for any enlightenment!

Upvotes: 0

Views: 120

Answers (3)

Perefexexos
Perefexexos

Reputation: 252

Use import re

urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', url)

Upvotes: 1

Elisha
Elisha

Reputation: 4951

all that is not " : [^"]

so you can put: [^"]*"

and get: '<a href="[^"]*"'

Upvotes: 2

alexis
alexis

Reputation: 50210

You'll soon discover that not all URLs start with www, and many don't even start with http://. Here's how you would extract all URLs in an href attribute of a link: Match everything within the quotes that follow the <a href=. Spaces are legal in various places inside an HTML tag, which complicates things a little:

matchobj = re.search(r'<\s*a\s+href\s*=\s*"([^"]*)', text, re.IGNORECASE)
url = matchobj.group(1)

This will also get you relative URLs and other protocols besides http. If you're not interested in everything, it is easier to sort through the results after you've extracted them.

Upvotes: 1

Related Questions