Kevin
Kevin

Reputation: 421

Python Scraping PDF's From a Website Why Are They All Corrupt and the Same Size?

Hopefully this one will be an easy one. I am trying to do some webscraping where I download all the pdf files from a page. Currently I am scraping files from a sports page for practice. I used Automatetheboringstuff + a post from another user (retrieve links from web page using python and BeautifulSoup) to come up with this code.

import requests
import time
from bs4 import BeautifulSoup, SoupStrainer

r = requests.get('http://secsports.go.com/media/baseball')

soup = BeautifulSoup(r.content)

for link in BeautifulSoup(r.text, parseOnlyThese=SoupStrainer('a')):
    if link.has_attr('href'):
    if 'pdf' in str(link):
        image_file = open(os.path.join('E:\\thisiswhereiwantmypdfstogo', os.path.basename(link['href'])), 'wb')
        for chunk in r.iter_content(100000):
            image_file.write(chunk)
            image_file.close()

The files that are output to the directory I specify are all there which is great, but the filesize is the same for all of them and when I open up AdobePro to look at them I get an error that says:

"Adobe Acrobat could not open "FILENAMEHERE" because it is either not a supported filetype or because the file has been damaged (for example, it was sent as an email attachment and wasn't correctly decoded)."

A little hint that clued me in to something going wrong with the write process was that after running image_file.write(chunk) it outputs the same numbers for each file.

Here is what the pdfs look like in the folder:

the_corrupted_pdfs

I am thinking I just need to add a parameter somewhere during the writing process for it to work correctly, but I have no idea what it would be. I did some Google searching for an answer and also searched a bit on here but cannot find the answer.

Thanks!

Upvotes: 1

Views: 898

Answers (1)

Kevin
Kevin

Reputation: 421

Hmmm. After doing some more research it seems like I figured out the problem. I do not understand exactly why this works, but I'll take a stab at it. I modified my code such that each link(['href']) becomes a response object. Then I wrote those to my directory and it worked.

Upvotes: 1

Related Questions