user4275254
user4275254

Reputation:

Get only URL from string - Python

I am scraping a page with Python and BeautifulSoup library.

I have to get the URL only from this string. This actually is in href attribute of the a tag. I have scraped it but cannot seem to find a way to extract the URL from this

javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');

Upvotes: 0

Views: 133

Answers (4)

Mithun
Mithun

Reputation: 19

Here's a quick and ugly answer

href.split("'")[1]

Upvotes: -1

user4275254
user4275254

Reputation:

I did it that way.

terms = javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');

terms.split("('")[1].split("','")[0]

outputs

/Sheraton-Tucson-Hotel-177/tnc/150/24795/en

Upvotes: 1

Jon Clements
Jon Clements

Reputation: 142216

Instead of a regex, you could just partition it twice on something, (eg: '):

s.partition("'")[2].partition("'")[0]
# /Sheraton-Tucson-Hotel-177/tnc/150/24795/en

Upvotes: 0

senshin
senshin

Reputation: 10360

You can write a straightforward regex to extract the URL.

>>> import re
>>> href = "javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');"
>>> re.findall(r"'(.*?)'", href)
['/Sheraton-Tucson-Hotel-177/tnc/150/24795/en', 'TC_POPUP', 'width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no']
>>> _[0]
'/Sheraton-Tucson-Hotel-177/tnc/150/24795/en'

The regex in question here is

'(.*?)'

Which reads "find a single-quote, followed by whatever (and capture the whatever), followed by another single quote, and do so non-greedily because of the ? operator". This extracts the arguments of window.open; then, just pick the first one to get the URL.

You shouldn't have any nested ' in your href, since those should be escaped to %27. If you do, though, this will not work, and you may need a solution that doesn't use regexes.

Upvotes: 2

Related Questions