Python beautifulsoup not able to extract a hyperlink from href tag

Question

I am trying to scrape the data from a website. It has an excel sheet inside the tag a and href. I have tried multiple ways using requests and beautifulsoup but i am not getting the link of the excel sheet.

Website url - https://ppac.gov.in/prices/international-prices-of-crude-oil item which i want to scrape is

after inspecting the element i get the details as below: -

I have tried the below code: , but every time when I try ,I get all the links except this xlsx file.

from bs4 import BeautifulSoup
import urllib
import re
import requests
html_page = urllib.request.urlopen(url)
links = []
soup = BeautifulSoup(html_page, "html.parser")
for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
    print(link.get('href'))
    links.append(link.get('href'))

Output which i get has all the links except the above mentioned excel file url. Can anyone help me to get the URL which changes daily hence i need to scrape it using http regex or xlsx (tried this also for link in soup.find_all(attrs={'href': re.compile("xlsx")}))

Expected output is the url to excel file :- https://ppac.gov.in/uploads/reports/1673497201_english_1_Crude%20Oil%20FOB%20Price%20%28Indian%20Basket%29.xlsx

HedgeHog · Accepted Answer

Data comes via XHR request (check your browsers dev tools to get also information for payload data) and is rendered dynamically by browser, so best way would be to use the same request to get your data as JSON.

Example

import requests

url  = f'https://ppac.gov.in/AjaxController/getInternationalPricesCrudeOil'

requests.post(
    url, 
    data={
        'financialYear':'2022-2023',
        'reportBy':4,
        'pageId':30
    }).json()['result']['1']['file_name']

Output

https://ppac.gov.in/uploads/reports/1673497201_english_1_Crude%20Oil%20FOB%20Price%20%28Indian%20Basket%29.xlsx

Python beautifulsoup not able to extract a hyperlink from href tag

Answers (2)

Example

Output

Related Questions