Reputation: 421
Hopefully this one will be an easy one. I am trying to do some webscraping where I download all the pdf files from a page. Currently I am scraping files from a sports page for practice. I used Automatetheboringstuff + a post from another user (retrieve links from web page using python and BeautifulSoup) to come up with this code.
import requests
import time
from bs4 import BeautifulSoup, SoupStrainer
r = requests.get('http://secsports.go.com/media/baseball')
soup = BeautifulSoup(r.content)
for link in BeautifulSoup(r.text, parseOnlyThese=SoupStrainer('a')):
if link.has_attr('href'):
if 'pdf' in str(link):
image_file = open(os.path.join('E:\\thisiswhereiwantmypdfstogo', os.path.basename(link['href'])), 'wb')
for chunk in r.iter_content(100000):
image_file.write(chunk)
image_file.close()
The files that are output to the directory I specify are all there which is great, but the filesize is the same for all of them and when I open up AdobePro to look at them I get an error that says:
"Adobe Acrobat could not open "FILENAMEHERE" because it is either not a supported filetype or because the file has been damaged (for example, it was sent as an email attachment and wasn't correctly decoded)."
A little hint that clued me in to something going wrong with the write process was that after running image_file.write(chunk) it outputs the same numbers for each file.
Here is what the pdfs look like in the folder:
I am thinking I just need to add a parameter somewhere during the writing process for it to work correctly, but I have no idea what it would be. I did some Google searching for an answer and also searched a bit on here but cannot find the answer.
Thanks!
Upvotes: 1
Views: 898
Reputation: 421
Hmmm. After doing some more research it seems like I figured out the problem. I do not understand exactly why this works, but I'll take a stab at it. I modified my code such that each link(['href']) becomes a response object. Then I wrote those to my directory and it worked.
Upvotes: 1