x89
x89

Reputation: 3470

Finding a Substring in A Link

So in my Python function, I pass on a url, search for pdf files on that url and then download those files. For most cases, it works perfectly.

def get_pdfs(my_url):
    html = urllib2.urlopen(my_url).read()
    html_page = BeautifulSoup(html)
    current_link = ''
    links = []
    for link in html_page.find_all('a'):
        current_link = link.get('href')
        if current_link.endswith('pdf'):
            print(current_link)
            links.append(my_url + current_link)
    #print(links)

    for link in links:
        #urlretrieve(link)
        wget.download(link)


get_pdfs('https://grader.eecs.jacobs-university.de/courses/320241/2019_2/')

However when I try using my function for a particular course website, my current_link is

/courses/320241/2019_2/lectures/lecture_7_8.pdf

though it should be automatically detected & should be only

lectures/lecture_7_8.pdf

while the original my_url that I passed on to the function was

https://grader.eecs.jacobs-university.de/courses/320241/2019_2/

since I'm appending both of them & a part of the link is repeated, the files downloaded are corrupted. How can I check current_link if any part is repeated from my_url and if yes, how can I remove it before downloading?

Upvotes: 1

Views: 73

Answers (1)

Sers
Sers

Reputation: 12255

Update using urljoin from urllib.parse will do job:

import urllib.parse import urljoin

def get_pdfs(my_url):
    html = urllib2.urlopen(my_url).read()
    html_page = BeautifulSoup(html)
    current_link = ''
    links = []
    for link in html_page.find_all('a'):
        current_link = link.get('href')
        if current_link.endswith('pdf'):
            print(current_link)
            links.append(urljoin(my_url, current_link))
    #print(links)

    for link in links:
        #urlretrieve(link)
        wget.download(link)

Simplified method, .select('a[href$=pdf]') select all links in which the href ends with pdf:

import urllib.parse import urljoin

def get_pdfs(my_url):
    html = urllib2.urlopen(my_url).read()
    html_page = BeautifulSoup(html)
    [wget.download(urljoin(my_url, link.get('href'))) for link in html_page.select('a[href$=pdf]')]

Upvotes: 1

Related Questions