Reputation: 3470
So in my Python function, I pass on a url, search for pdf files on that url and then download those files. For most cases, it works perfectly.
def get_pdfs(my_url):
html = urllib2.urlopen(my_url).read()
html_page = BeautifulSoup(html)
current_link = ''
links = []
for link in html_page.find_all('a'):
current_link = link.get('href')
if current_link.endswith('pdf'):
print(current_link)
links.append(my_url + current_link)
#print(links)
for link in links:
#urlretrieve(link)
wget.download(link)
get_pdfs('https://grader.eecs.jacobs-university.de/courses/320241/2019_2/')
However when I try using my function for a particular course website, my current_link
is
/courses/320241/2019_2/lectures/lecture_7_8.pdf
though it should be automatically detected & should be only
lectures/lecture_7_8.pdf
while the original my_url that I passed on to the function was
https://grader.eecs.jacobs-university.de/courses/320241/2019_2/
since I'm appending both of them & a part of the link is repeated, the files downloaded are corrupted. How can I check current_link
if any part is repeated from my_url
and if yes, how can I remove it before downloading?
Upvotes: 1
Views: 73
Reputation: 12255
Update using urljoin
from urllib.parse
will do job:
import urllib.parse import urljoin
def get_pdfs(my_url):
html = urllib2.urlopen(my_url).read()
html_page = BeautifulSoup(html)
current_link = ''
links = []
for link in html_page.find_all('a'):
current_link = link.get('href')
if current_link.endswith('pdf'):
print(current_link)
links.append(urljoin(my_url, current_link))
#print(links)
for link in links:
#urlretrieve(link)
wget.download(link)
Simplified method, .select('a[href$=pdf]')
select all links in which the href ends with pdf:
import urllib.parse import urljoin
def get_pdfs(my_url):
html = urllib2.urlopen(my_url).read()
html_page = BeautifulSoup(html)
[wget.download(urljoin(my_url, link.get('href'))) for link in html_page.select('a[href$=pdf]')]
Upvotes: 1