surfi
surfi

Reputation: 1591

Extracting a URL's in Python from XML

I read this thread about extracting url's from a string. https://stackoverflow.com/a/840014/326905 Really nice, i got all url's from a XML document containing http://www.blabla.com with

>>> s = '<link href="http://www.blabla.com/blah" />
         <link href="http://www.blabla.com" />'
>>> re.findall(r'(https?://\S+)', s)
['http://www.blabla.com/blah"', 'http://www.blabla.com"']

But i can't figure out, how to customize the regex to omit the double qoute at the end of the url.

First i thought that this is the clue

re.findall(r'(https?://\S+\")', s)

or this

re.findall(r'(https?://\S+\Z")', s)

but it isn't.

Can somebody help me out and tell me how to omit the double quote at the end?

Btw. the questionmark after the "s" of https means "s" can occur or can not occur. Am i right?

Upvotes: 0

Views: 2796

Answers (5)

Drover
Drover

Reputation: 116

>>>from lxml import html
>>>ht = html.fromstring(s)
>>>ht.xpath('//a/@href')
['http://www.blabla.com/blah', 'http://www.blabla.com']

Upvotes: 2

Kenneth K.
Kenneth K.

Reputation: 3039

You're already using a character class (albeit a shorthand version). I might suggest modifying the character class a bit, that way you don't need a lookahead. Simply add the quote as part of the character class:

re.findall(r'(https?://[^\s"]+)', s)

This still says "one or more characters not a whitespace," but has the addition of not including double quotes either. So the overall expression is "one or more character not a whitespace and not a double quote."

Upvotes: 1

Thanasis Petsas
Thanasis Petsas

Reputation: 4448

I used to extract URLs from text through this piece of code:

url_rgx = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
# convert string to lower case
text = text.lower()
matches = re.findall(url_rgx, text)
# patch the 'http://' part if it is missed
urls = ['http://%s'%url[0] if not url[0].startswith('http') else url[0] for url in matches]
print urls

It works great!

Upvotes: 1

surfi
surfi

Reputation: 1591

Thanks. I just read this https://stackoverflow.com/a/13057368/326905

and checked out this which is also working.

re.findall(r'"(https?://\S+)"', urls) 

Upvotes: 0

Daedalus
Daedalus

Reputation: 1667

You want the double quotes to appear as a look-ahead:

re.findall(r'(https?://\S+)(?=\")', s)

This way they won't appear as part of the match. Also, yes the ? means the character is optional.

See example here: http://regexr.com?347nk

Upvotes: 1

Related Questions