m.gibin
m.gibin

Reputation: 137

How to get href from <a> tag which contains JavaScript using Python?

I am trying to get href from a tag using Python + Selenium, but the href is having "JavaScript" in it. So I am unable to get the target URL.

I am using Python 3.7.3, selenium 3.141.0.

HTML:

<a href="javascript:GoPDF('FS1546')" style="TEXT-DECORATION: Underline">Aberdeen Standard Wholesale Australian Fixed Income</a>

Code:

from selenium import webdriver
driver = webdriver.Chrome("chromedriver.exe")
driver.get("http://www.colonialfirststate.com.au/Price_performance/performanceNPrice.aspx?menutabtype=performance&CompanyCode=001&Public=1&MainGroup=IF&BrandName=FC&ProductIDs=91&Product=FirstChoice+Wholesale+Investments&ACCodes=&ACText=&SearchType=Performance&Multi=False&Hedge=False&IvstType=Investment+products&IvstGroup=&APIR=&FundIDs=&FundName=&FundNames=&SearchProdIDs=&Redirect=1")
print(driver.find_elements_by_xpath("tbody/tr[5]/td[1]/a")

what I need is the target URL as:

https://www3.colonialfirststate.com.au/content/dam/prospects/fs/1/5/fs1546.pdf?3

but its giving me:

javascript:GoPDF('FS2311')

Upvotes: 4

Views: 439

Answers (2)

Kim
Kim

Reputation: 1664

Kudos to the accepted answer for doing the background work.

I'd recommend using urllib.parse facilities from the standard library. URLs are not as straightforward as they first appear and the guys who wrote urllib are experts on the URL standard, RFC 808.

Since you are web scraping, further down the track you will probably need to apply the same process to a variety of URLs, including those with different domain names, multi-digit query components (?1234 and a whole other set of possibilities) or even fragments (?1234#example etc.) The accepted answer will fail on all of these.

The following code looks more complicated at first sight but delegates the tricky (and potentially brittle) URL stuff to urllib. It also uses more robust and flexible methods to extract the GoPDF fileId and the invariant part of the url.

from urllib.parse import urlparse, urlunparse


def build_pdf_url(model_url, js_href):
    url = urlparse(model_url)
    pdf_fileid = get_fileid_from_js_href(js_href)
    pdf_path = build_pdf_path(model_url, pdf_fileid)
    return urlunparse((url.scheme, url.netloc, pdf_path, url.params,
                      url.query, url.fragment))


def get_fileid_from_js_href(href):
    """extract fileid by extracting text between single quotes"""
    return href.split("'")[1].lower()


def build_pdf_path(url, pdf_fileid):
    prefix = pdf_fileid[:2]
    major_version = pdf_fileid[2]
    minor_version = pdf_fileid[3]
    filename = pdf_fileid + '.pdf'
    return '/'.join([invariant_path(url), prefix, major_version, minor_version, filename])


def invariant_path(url, dropped_components=4):
    """
    return all but the dropped components of the URL 'path'
    NOTE: path components are separated by '/'
    """
    path_components = urlparse(url).path.split('/')
    return '/'.join(path_components[:-dropped_components])


js_href = "javascript:GoPDF('FS1546')"
model_url = "https://www3.colonialfirststate.com.au/content/dam/prospects/fs/2/3/fs2311.pdf?3"
print(build_pdf_url(model_url, js_href))


$ python urlbuild.py
https://www3.colonialfirststate.com.au/content/dam/prospects/fs/1/5/fs1546.pdf?3

Upvotes: 1

CodeIt
CodeIt

Reputation: 3618

I checked the PDF url from the popup and found out how they are generating the URL.

They use file name (ex. FS2065) to generate the PDF URL.

The url of the PDF look like this, https://www3.colonialfirststate.com.au/content/dam/prospects/fs/2/0/fs2065.pdf?3

For all PDFs up to this part, it is having the same path

https://www3.colonialfirststate.com.au/content/dam/prospects/

After that part we have a path generated using the fileID,

fs/2/0/fs2065.pdf?3
 | | |     |     ||
 | | |     |     ++--- Not needed (But you can keep if you want)
 | | |     |
 | | |     +---- File Name
 | | +---------- 4th character in the file name 
 | +------------ 3rd character in the file name 
 +-------------- First two characters in the file name 

We can use this as a workaround to get the exact url.

url = "javascript:GoPDF('FS2311')" # javascript URL  

pdfFileId = url[18:-2].lower() # extracts the file name from the Javascript URL

pdfBaseUrl = "https://www3.colonialfirststate.com.au/content/dam/prospects/%s/%s/%s/%s.pdf?3"%(pdfFileId[:2],pdfFileId[2],pdfFileId[3],pdfFileId) 

print(pdfBaseUrl)
# prints https://www3.colonialfirststate.com.au/content/dam/prospects/fs/2/3/fs2311.pdf?3

See it in action here.

Upvotes: 3

Related Questions