SIM
SIM

Reputation: 22440

Unable to download files from a certain website

I've written some code in python to download files from a webpage. As i do not have any idea how to download files from any site so i could only scrape the file links from that site. If someone could help me achieve that I would be very grateful to him. Thanks a lot in advance.

Link to that site: web_link

Here is my try:

from bs4 import BeautifulSoup
import requests

response = requests.get("http://usda.mannlib.cornell.edu/MannUsda/viewDocumentInfo.do?documentID=1194")
soup = BeautifulSoup(response.text,"lxml")
for item in soup.select("#latest a"):
    print(item['href'])

Upon execution, the above script produces four different urls to those files.

Upvotes: 0

Views: 77

Answers (2)

etaloof
etaloof

Reputation: 662

You can use request.get:

import requests
from bs4 import BeautifulSoup

response = requests.get("http://usda.mannlib.cornell.edu/MannUsda/"
                        "viewDocumentInfo.do?documentID=1194")
soup = BeautifulSoup(response.text, "lxml")
for item in soup.select("#latest a"):
    filename = item['href'].split('/')[-1]
    with open(filename, 'wb') as f:
        f.write(requests.get(item['href']).content)

Upvotes: 2

alecxe
alecxe

Reputation: 473863

You can go with a standard library's urllib.request.urlretrieve(), but, since you are already using requests, you can re-use the session here (download_file was largely taken from this answer):

from bs4 import BeautifulSoup
import requests


def download_file(session, url):
    local_filename = url.split('/')[-1]

    r = session.get(url, stream=True)
    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

    return local_filename


with requests.Session() as session:
    response = session.get("http://usda.mannlib.cornell.edu/MannUsda/viewDocumentInfo.do?documentID=1194")
    soup = BeautifulSoup(response.text,"lxml")
    for item in soup.select("#latest a"):
        local_filename = download_file(session, item['href'])
        print(f"Downloaded {local_filename}")

Upvotes: 1

Related Questions