Reputation: 22440
I've written some code in python to download files from a webpage. As i do not have any idea how to download files from any site so i could only scrape the file links from that site. If someone could help me achieve that I would be very grateful to him. Thanks a lot in advance.
Link to that site: web_link
Here is my try:
from bs4 import BeautifulSoup
import requests
response = requests.get("http://usda.mannlib.cornell.edu/MannUsda/viewDocumentInfo.do?documentID=1194")
soup = BeautifulSoup(response.text,"lxml")
for item in soup.select("#latest a"):
print(item['href'])
Upon execution, the above script produces four different urls to those files.
Upvotes: 0
Views: 77
Reputation: 662
You can use request.get
:
import requests
from bs4 import BeautifulSoup
response = requests.get("http://usda.mannlib.cornell.edu/MannUsda/"
"viewDocumentInfo.do?documentID=1194")
soup = BeautifulSoup(response.text, "lxml")
for item in soup.select("#latest a"):
filename = item['href'].split('/')[-1]
with open(filename, 'wb') as f:
f.write(requests.get(item['href']).content)
Upvotes: 2
Reputation: 473863
You can go with a standard library's urllib.request.urlretrieve()
, but, since you are already using requests
, you can re-use the session here (download_file
was largely taken from this answer):
from bs4 import BeautifulSoup
import requests
def download_file(session, url):
local_filename = url.split('/')[-1]
r = session.get(url, stream=True)
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
return local_filename
with requests.Session() as session:
response = session.get("http://usda.mannlib.cornell.edu/MannUsda/viewDocumentInfo.do?documentID=1194")
soup = BeautifulSoup(response.text,"lxml")
for item in soup.select("#latest a"):
local_filename = download_file(session, item['href'])
print(f"Downloaded {local_filename}")
Upvotes: 1