Reputation: 137
I am trying to get href
from a tag using Python + Selenium, but the href
is having "JavaScript"
in it. So I am unable to get the target URL.
I am using Python 3.7.3
, selenium 3.141.0
.
HTML:
<a href="javascript:GoPDF('FS1546')" style="TEXT-DECORATION: Underline">Aberdeen Standard Wholesale Australian Fixed Income</a>
Code:
from selenium import webdriver
driver = webdriver.Chrome("chromedriver.exe")
driver.get("http://www.colonialfirststate.com.au/Price_performance/performanceNPrice.aspx?menutabtype=performance&CompanyCode=001&Public=1&MainGroup=IF&BrandName=FC&ProductIDs=91&Product=FirstChoice+Wholesale+Investments&ACCodes=&ACText=&SearchType=Performance&Multi=False&Hedge=False&IvstType=Investment+products&IvstGroup=&APIR=&FundIDs=&FundName=&FundNames=&SearchProdIDs=&Redirect=1")
print(driver.find_elements_by_xpath("tbody/tr[5]/td[1]/a")
what I need is the target URL
as:
https://www3.colonialfirststate.com.au/content/dam/prospects/fs/1/5/fs1546.pdf?3
but its giving me:
javascript:GoPDF('FS2311')
Upvotes: 4
Views: 439
Reputation: 1664
Kudos to the accepted answer for doing the background work.
I'd recommend using urllib.parse facilities from the standard library. URLs are not as straightforward as they first appear and the guys who wrote urllib
are experts on the URL standard, RFC 808.
Since you are web scraping, further down the track you will probably need to apply the same process to a variety of URLs, including those with different domain names, multi-digit query components (?1234
and a whole other set of possibilities) or even fragments (?1234#example
etc.) The accepted answer will fail on all of these.
The following code looks more complicated at first sight but delegates the tricky (and potentially brittle) URL stuff to urllib
. It also uses more robust and flexible methods to extract the GoPDF fileId and the invariant part of the url.
from urllib.parse import urlparse, urlunparse
def build_pdf_url(model_url, js_href):
url = urlparse(model_url)
pdf_fileid = get_fileid_from_js_href(js_href)
pdf_path = build_pdf_path(model_url, pdf_fileid)
return urlunparse((url.scheme, url.netloc, pdf_path, url.params,
url.query, url.fragment))
def get_fileid_from_js_href(href):
"""extract fileid by extracting text between single quotes"""
return href.split("'")[1].lower()
def build_pdf_path(url, pdf_fileid):
prefix = pdf_fileid[:2]
major_version = pdf_fileid[2]
minor_version = pdf_fileid[3]
filename = pdf_fileid + '.pdf'
return '/'.join([invariant_path(url), prefix, major_version, minor_version, filename])
def invariant_path(url, dropped_components=4):
"""
return all but the dropped components of the URL 'path'
NOTE: path components are separated by '/'
"""
path_components = urlparse(url).path.split('/')
return '/'.join(path_components[:-dropped_components])
js_href = "javascript:GoPDF('FS1546')"
model_url = "https://www3.colonialfirststate.com.au/content/dam/prospects/fs/2/3/fs2311.pdf?3"
print(build_pdf_url(model_url, js_href))
$ python urlbuild.py
https://www3.colonialfirststate.com.au/content/dam/prospects/fs/1/5/fs1546.pdf?3
Upvotes: 1
Reputation: 3618
I checked the PDF url from the popup and found out how they are generating the URL.
They use file name (ex. FS2065) to generate the PDF URL.
The url of the PDF look like this, https://www3.colonialfirststate.com.au/content/dam/prospects/fs/2/0/fs2065.pdf?3
For all PDFs up to this part, it is having the same path
https://www3.colonialfirststate.com.au/content/dam/prospects/
After that part we have a path generated using the fileID,
fs/2/0/fs2065.pdf?3
| | | | ||
| | | | ++--- Not needed (But you can keep if you want)
| | | |
| | | +---- File Name
| | +---------- 4th character in the file name
| +------------ 3rd character in the file name
+-------------- First two characters in the file name
We can use this as a workaround to get the exact url.
url = "javascript:GoPDF('FS2311')" # javascript URL
pdfFileId = url[18:-2].lower() # extracts the file name from the Javascript URL
pdfBaseUrl = "https://www3.colonialfirststate.com.au/content/dam/prospects/%s/%s/%s/%s.pdf?3"%(pdfFileId[:2],pdfFileId[2],pdfFileId[3],pdfFileId)
print(pdfBaseUrl)
# prints https://www3.colonialfirststate.com.au/content/dam/prospects/fs/2/3/fs2311.pdf?3
See it in action here.
Upvotes: 3