Reputation: 84465
Problem:
Don't know if google fu is failing me again but I am unable to download csvs from a list of urls. I have used requests
and bs4
to gather the urls (the final list is correct) - see process below for more info.
I then followed one of the answers given here using urllib
to download: Trying to download data from URL with CSV File, as well as a number other stackoverflow python answers for downloading csvs.
Currently I am stuck with an
HTTP Error 404: Not Found
(below stack trace is from last attempt where passing User-Agent)
----> 9 f = urllib.request.urlopen(req)
10 print(f.read().decode('utf-8'))
#other lines
--> 650 raise HTTPError(req.full_url, code, msg, hdrs, fp)
651
652 class HTTPRedirectHandler(BaseHandler):
HTTPError: HTTP Error 404: Not Found
I tried the solution here of adding a User-Agent
: Web Scraping using Python giving HTTP Error 404: Not Found , though I would have expected a 403 not 404 error code - but seems to have worked for a number of OPs.
This still failed with same error. I am pretty sure I can solve this by simply using selenium and passing the csv urls to .get but I want to know if I can solve this with requests alone.
Outline:
I visit this this page:
I grab all the monthly version links e.g. Patients Registered at a GP Practice May 2019
, I then visit each of those pages and grab all the csv links within.
I loop the final dictionary of filename:download_url
pairs attempting to download the files.
Question:
Can anyone see what I am doing wrong or how to fix this so I can download the files without resorting to selenium? I'm also unsure of the most efficient way to accomplish this - perhaps urllib is not actually required at all and just requests will suffice?
Python:
Without user-agent:
import requests
from bs4 import BeautifulSoup as bs
import urllib
base = 'https://digital.nhs.uk/'
all_files = []
with requests.Session() as s:
r = s.get('https://digital.nhs.uk/data-and-information/publications/statistical/patients-registered-at-a-gp-practice')
soup = bs(r.content, 'lxml')
links = [base + item['href'] for item in soup.select('.cta__button')]
for link in links:
r = s.get(link)
soup = bs(r.content, 'lxml')
file_links = {item.text.strip().split('\n')[0]:base + item['href'] for item in soup.select('[href$=".csv"]')}
if file_links:
all_files.append(file_links) #ignore empty dicts as for some months there is no data yet
else:
print('no data : ' + link)
all_files = {k: v for d in all_files for k, v in d.items()} #flatten list of dicts to single dict
path = r'C:\Users\User\Desktop'
for k,v in all_files.items():
#print(k,v)
print(v)
response = urllib.request.urlopen(v)
html = response.read()
with open(path + '\\' + k + '.csv', 'wb') as f:
f.write(html)
break #as only need one test case
Test with adding User-Agent:
req = urllib.request.Request(
v,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
f = urllib.request.urlopen(req)
print(f.read().decode('utf-8'))
Upvotes: 1
Views: 83
Reputation: 28565
looking at the values, it's showing me for your links
https://digital.nhs.uk/https://files.digital.nhs.uk/publicationimport/pub13xxx/pub13932/gp-reg-patients-04-2014-lsoa.csv
I think you want to drop the base +
, so use this:
file_links = {item.text.strip().split('\n')[0]:item['href'] for item in soup.select('[href$=".csv"]')}
instead of:
file_links = {item.text.strip().split('\n')[0]:base + item['href'] for item in soup.select('[href$=".csv"]')}
Edit: Full Code:
import requests
from bs4 import BeautifulSoup as bs
base = 'https://digital.nhs.uk/'
all_files = []
with requests.Session() as s:
r = s.get('https://digital.nhs.uk/data-and-information/publications/statistical/patients-registered-at-a-gp-practice')
soup = bs(r.content, 'lxml')
links = [base + item['href'] for item in soup.select('.cta__button')]
for link in links:
r = s.get(link)
soup = bs(r.content, 'lxml')
file_links = {item.text.strip().split('\n')[0]:item['href'] for item in soup.select('[href$=".csv"]')}
if file_links:
all_files.append(file_links) #ignore empty dicts as for some months there is no data yet
else:
print('no data : ' + link)
all_files = {k: v for d in all_files for k, v in d.items()} #flatten list of dicts to single dict
path = 'C:/Users/User/Desktop/'
for k,v in all_files.items():
#print(k,v)
print(v)
response = requests.get(v)
html = response.content
k = k.replace(':', ' -')
file = path + k + '.csv'
with open(file, 'wb' ) as f:
f.write(html)
break #as only need one test case
Upvotes: 1