Jkiefn1
Jkiefn1

Reputation: 143

Downloading data from Google Drives via Python Requests

I am having trouble accessing .pdf data from a website, stored via a google drive.

The website can be found here.

Source code shows that the links I'm after are easily-recognized...

<ul style="margin-left: 40px;">
<li><a href="https://drive.google.com/open?id=11Zw72KDm4cdfswuCjbeM2c3sM6kdcowE" target="_blank">January 4, 6-9, 2020</a></li>
<li><a href="https://drive.google.com/a/agfc.ar.gov/file/d/1OtSZtBxaNWGqlDvBp-cG7TAwOHjYacm_/view?usp=sharing" target="_blank">December 12-20, 2019</a></li>
<li><a href="https://drive.google.com/open?id=1HPa1REOTy_Kz9wxLUpT4N57KEurE8Z9f" target="_blank">November 16-19, 2019</a></li>
<li><a href="https://drive.google.com/open?id=1iCBknPwIxirmWeiD7VPKxwCYvgQUkOB-" target="_blank">January 20-23, 2019</a></li>

...with everything between a href=" and " target="_blank" the hyperlinks I'm after.

I've tried to go about it using requests.get()...

site = 'site goes here'

url_locs = []

url_locs = BeautifulSoup(requests.get(site).text.lower(), 'html.parser').findAll('ul', {'style': 'margin-left: 40px;'})

# Locate the url for the pdf
report_urls = re.findall('<li><a href="(.*?)" target="', str(url_locs))
#print (report_urls)

# Download and save the individual pdfs, then record the filepath to add to the INDEX
for url in report_urls:
    r = requests.get(url)
    print(r)

... but the output is <Response [404]> for all.

Doing some digging in the APIs, and looking for answers to previous similar questions like this one and this one I can tell there is a step I'm missing, or maybe the whole approach is off, but I am not quite sure where to go from here.

The google drive is accessible to anyone who goes to the site, so I wouldn't know what the authentication information would be, nor is there any mention of a "driver".

Simply copying and pasting the links from the source code into my broswer returns a 404 Error, so I imagine I'm pretty far off in my approach.

Any an all help is warmly appreciated.

Upvotes: 0

Views: 483

Answers (1)

Iamblichus
Iamblichus

Reputation: 19309

Issue:

You are setting all content retrieved from the site as lowercase. Drive links are based on the corresponding file id's, which are case-sensitive, so the links you are trying to access are not valid ones. Hence, you get 404.

Solution:

When making the get request to the site, don't set the response to lowercase. Change this:

requests.get(site).text.lower()

To this:

requests.get(site).text

Upvotes: 1

Related Questions