How to get href from tag which contains JavaScript using Python?

Question

I am trying to get href from a tag using Python + Selenium, but the href is having "JavaScript" in it. So I am unable to get the target URL.

I am using Python 3.7.3, selenium 3.141.0.

HTML:

Aberdeen Standard Wholesale Australian Fixed Income

Code:

from selenium import webdriver
driver = webdriver.Chrome("chromedriver.exe")
driver.get("http://www.colonialfirststate.com.au/Price_performance/performanceNPrice.aspx?menutabtype=performance&CompanyCode=001&Public=1&MainGroup=IF&BrandName=FC&ProductIDs=91&Product=FirstChoice+Wholesale+Investments&ACCodes=&ACText=&SearchType=Performance&Multi=False&Hedge=False&IvstType=Investment+products&IvstGroup=&APIR=&FundIDs=&FundName=&FundNames=&SearchProdIDs=&Redirect=1")
print(driver.find_elements_by_xpath("tbody/tr[5]/td[1]/a")

what I need is the target URL as:

https://www3.colonialfirststate.com.au/content/dam/prospects/fs/1/5/fs1546.pdf?3

but its giving me:

javascript:GoPDF('FS2311')

Kim · Accepted Answer

Kudos to the accepted answer for doing the background work.

I'd recommend using urllib.parse facilities from the standard library. URLs are not as straightforward as they first appear and the guys who wrote urllib are experts on the URL standard, RFC 808.

Since you are web scraping, further down the track you will probably need to apply the same process to a variety of URLs, including those with different domain names, multi-digit query components (?1234 and a whole other set of possibilities) or even fragments (?1234#example etc.) The accepted answer will fail on all of these.

The following code looks more complicated at first sight but delegates the tricky (and potentially brittle) URL stuff to urllib. It also uses more robust and flexible methods to extract the GoPDF fileId and the invariant part of the url.

from urllib.parse import urlparse, urlunparse


def build_pdf_url(model_url, js_href):
    url = urlparse(model_url)
    pdf_fileid = get_fileid_from_js_href(js_href)
    pdf_path = build_pdf_path(model_url, pdf_fileid)
    return urlunparse((url.scheme, url.netloc, pdf_path, url.params,
                      url.query, url.fragment))


def get_fileid_from_js_href(href):
    """extract fileid by extracting text between single quotes"""
    return href.split("'")[1].lower()


def build_pdf_path(url, pdf_fileid):
    prefix = pdf_fileid[:2]
    major_version = pdf_fileid[2]
    minor_version = pdf_fileid[3]
    filename = pdf_fileid + '.pdf'
    return '/'.join([invariant_path(url), prefix, major_version, minor_version, filename])


def invariant_path(url, dropped_components=4):
    """
    return all but the dropped components of the URL 'path'
    NOTE: path components are separated by '/'
    """
    path_components = urlparse(url).path.split('/')
    return '/'.join(path_components[:-dropped_components])


js_href = "javascript:GoPDF('FS1546')"
model_url = "https://www3.colonialfirststate.com.au/content/dam/prospects/fs/2/3/fs2311.pdf?3"
print(build_pdf_url(model_url, js_href))


$ python urlbuild.py
https://www3.colonialfirststate.com.au/content/dam/prospects/fs/1/5/fs1546.pdf?3

How to get href from <a> tag which contains JavaScript using Python?

Answers (2)

Related Questions

How to get href from &lt;a&gt; tag which contains JavaScript using Python?

Answers (2)

Related Questions

How to get href from <a> tag which contains JavaScript using Python?