x89
x89

Reputation: 3490

Using BeautifulSoup to Download Links From A WebPage

I wrote a function to find all .pdf files from a web-page & download them. It works well when the link is publicly accessible but when I use it for a course website (which can only be accessed on my university's internet), the pdfs downloaded are corrupted and cannot be opened.

How can I fix it?

def get_pdfs(my_url):
    html = urllib2.urlopen(my_url).read()
    html_page = BeautifulSoup(html)
    current_link = ''
    links = []
    for link in html_page.find_all('a'):
        current_link = link.get('href')
        if current_link.endswith('pdf'):
            links.append(my_url + current_link)
    print(links)

    for link in links:
        #urlretrieve(link)
        wget.download(link)

get_pdfs('https://grader.eecs.jacobs-university.de/courses/320241/2019_2/')

When I use this grader link, the current_link is something like /courses/320241/2019_2/lectures/lecture_7_8.pdf but the /courses/320241/2019_2/ part is already included in the my_url and when I append it, it obviously doesn't work. However, the function works perfectly for [this link][1]:

Is there a way I can use the same function to work with both types of links?

Upvotes: 1

Views: 1879

Answers (1)

Jack Fleeting
Jack Fleeting

Reputation: 24940

OK, I think I understand the issue now. Try the code below on your data. I think it works, but obviously I couldn't try it directly on the page requiring login. Also, I changed your structure and variable definitions a bit, because I find it easier to think that way, but if it works, you can easily modify it to suit your own tastes.

Anyway, here goes:

import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urlparse

my_urls = ['https://cnds.jacobs-university.de/courses/os-2019/', 'https://grader.eecs.jacobs-university.de/courses/320241/2019_2']
links = []
for url in my_urls:    
    resp = requests.get(url)
    soup = bs(resp.text,'lxml')
    og = soup.find("meta",  property="og:url")
    base = urlparse(url)
    for link in soup.find_all('a'):
        current_link = link.get('href')
        if current_link.endswith('pdf'):
            if og:
                links.append(og["content"] + current_link)
            else:
                links.append(base.scheme+"://"+base.netloc + current_link)
for link in links:
    print(link)

Upvotes: 1

Related Questions