Reputation: 1591

Extracting a URL's in Python from XML

I read this thread about extracting url's from a string. https://stackoverflow.com/a/840014/326905 Really nice, i got all url's from a XML document containing http://www.blabla.com with

>>> s = '<link href="http://www.blabla.com/blah" />
         <link href="http://www.blabla.com" />'
>>> re.findall(r'(https?://\S+)', s)
['http://www.blabla.com/blah"', 'http://www.blabla.com"']

But i can't figure out, how to customize the regex to omit the double qoute at the end of the url.

First i thought that this is the clue

re.findall(r'(https?://\S+\")', s)

or this

re.findall(r'(https?://\S+\Z")', s)

but it isn't.

Can somebody help me out and tell me how to omit the double quote at the end?

Btw. the questionmark after the "s" of https means "s" can occur or can not occur. Am i right?

Upvotes: 0

Answers (5)

Drover

Reputation: 116

>>>from lxml import html
>>>ht = html.fromstring(s)
>>>ht.xpath('//a/@href')
['http://www.blabla.com/blah', 'http://www.blabla.com']

Upvotes: 2

Kenneth K.

Reputation: 3039

You're already using a character class (albeit a shorthand version). I might suggest modifying the character class a bit, that way you don't need a lookahead. Simply add the quote as part of the character class:

re.findall(r'(https?://[^\s"]+)', s)

This still says "one or more characters not a whitespace," but has the addition of not including double quotes either. So the overall expression is "one or more character not a whitespace and not a double quote."

Upvotes: 1

Thanasis Petsas

Reputation: 4448

I used to extract URLs from text through this piece of code:

url_rgx = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
# convert string to lower case
text = text.lower()
matches = re.findall(url_rgx, text)
# patch the 'http://' part if it is missed
urls = ['http://%s'%url[0] if not url[0].startswith('http') else url[0] for url in matches]
print urls

It works great!

Upvotes: 1

surfi

Reputation: 1591

Thanks. I just read this https://stackoverflow.com/a/13057368/326905

and checked out this which is also working.