Hunaidkhan
Hunaidkhan

Reputation: 1418

Python beautifulsoup not able to extract a hyperlink from href tag

I am trying to scrape the data from a website. It has an excel sheet inside the tag a and href. I have tried multiple ways using requests and beautifulsoup but i am not getting the link of the excel sheet.

Website url - https://ppac.gov.in/prices/international-prices-of-crude-oil item which i want to scrape is

enter image description here

after inspecting the element i get the details as below: - enter image description here

I have tried the below code: , but every time when I try ,I get all the links except this xlsx file.

from bs4 import BeautifulSoup
import urllib
import re
import requests
html_page = urllib.request.urlopen(url)
links = []
soup = BeautifulSoup(html_page, "html.parser")
for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
    print(link.get('href'))
    links.append(link.get('href'))

Output which i get has all the links except the above mentioned excel file url. Can anyone help me to get the URL which changes daily hence i need to scrape it using http regex or xlsx (tried this also for link in soup.find_all(attrs={'href': re.compile("xlsx")}))

Expected output is the url to excel file :- https://ppac.gov.in/uploads/reports/1673497201_english_1_Crude%20Oil%20FOB%20Price%20%28Indian%20Basket%29.xlsx

Upvotes: 0

Views: 97

Answers (2)

HedgeHog
HedgeHog

Reputation: 25048

Data comes via XHR request (check your browsers dev tools to get also information for payload data) and is rendered dynamically by browser, so best way would be to use the same request to get your data as JSON.

Example

import requests

url  = f'https://ppac.gov.in/AjaxController/getInternationalPricesCrudeOil'

requests.post(
    url, 
    data={
        'financialYear':'2022-2023',
        'reportBy':4,
        'pageId':30
    }).json()['result']['1']['file_name']

Output

https://ppac.gov.in/uploads/reports/1673497201_english_1_Crude%20Oil%20FOB%20Price%20%28Indian%20Basket%29.xlsx

Upvotes: 1

Matheus Irineu
Matheus Irineu

Reputation: 51

Does the button position change? if not, I'd use xpath to click or retrieve the url, it seems like if you do a driver.get_element_by_xpath(the button xpath).click() on that, it'd start downloading

Upvotes: 0

Related Questions