QHarr
QHarr

Reputation: 84465

Download csvs to desktop from csv links

Problem:

Don't know if google fu is failing me again but I am unable to download csvs from a list of urls. I have used requests and bs4 to gather the urls (the final list is correct) - see process below for more info.

I then followed one of the answers given here using urllib to download: Trying to download data from URL with CSV File, as well as a number other stackoverflow python answers for downloading csvs.

Currently I am stuck with an

HTTP Error 404: Not Found

(below stack trace is from last attempt where passing User-Agent)

----> 9 f = urllib.request.urlopen(req)
     10 print(f.read().decode('utf-8'))
     #other lines

--> 650         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    651 
    652 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 404: Not Found

I tried the solution here of adding a User-Agent: Web Scraping using Python giving HTTP Error 404: Not Found , though I would have expected a 403 not 404 error code - but seems to have worked for a number of OPs.

This still failed with same error. I am pretty sure I can solve this by simply using selenium and passing the csv urls to .get but I want to know if I can solve this with requests alone.


Outline:

I visit this this page:

https://digital.nhs.uk/data-and-information/publications/statistical/patients-registered-at-a-gp-practice

I grab all the monthly version links e.g. Patients Registered at a GP Practice May 2019, I then visit each of those pages and grab all the csv links within.

I loop the final dictionary of filename:download_url pairs attempting to download the files.


Question:

Can anyone see what I am doing wrong or how to fix this so I can download the files without resorting to selenium? I'm also unsure of the most efficient way to accomplish this - perhaps urllib is not actually required at all and just requests will suffice?


Python:

Without user-agent:

import requests
from bs4 import BeautifulSoup as bs
import urllib

base = 'https://digital.nhs.uk/'
all_files = []

with requests.Session() as s:
    r = s.get('https://digital.nhs.uk/data-and-information/publications/statistical/patients-registered-at-a-gp-practice')
    soup = bs(r.content, 'lxml')
    links = [base + item['href'] for item in soup.select('.cta__button')]

    for link in links:
        r = s.get(link)
        soup = bs(r.content, 'lxml')
        file_links = {item.text.strip().split('\n')[0]:base + item['href'] for item in soup.select('[href$=".csv"]')}
        if file_links:
            all_files.append(file_links)  #ignore empty dicts as for some months there is no data yet
        else:
            print('no data : ' + link)

all_files = {k: v for d in all_files for k, v in d.items()}  #flatten list of dicts to single dict


path = r'C:\Users\User\Desktop'

for k,v in all_files.items():
    #print(k,v)
    print(v)
    response = urllib.request.urlopen(v)
    html = response.read()

    with open(path + '\\' + k + '.csv', 'wb') as f:
        f.write(html)
    break  #as only need one test case

Test with adding User-Agent:

req = urllib.request.Request(
    v, 
    data=None, 
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
    }
)

f = urllib.request.urlopen(req)
print(f.read().decode('utf-8'))

Upvotes: 1

Views: 83

Answers (1)

chitown88
chitown88

Reputation: 28565

looking at the values, it's showing me for your links

https://digital.nhs.uk/https://files.digital.nhs.uk/publicationimport/pub13xxx/pub13932/gp-reg-patients-04-2014-lsoa.csv

I think you want to drop the base +, so use this:

file_links = {item.text.strip().split('\n')[0]:item['href'] for item in soup.select('[href$=".csv"]')}

instead of:

file_links = {item.text.strip().split('\n')[0]:base + item['href'] for item in soup.select('[href$=".csv"]')}

Edit: Full Code:

import requests
from bs4 import BeautifulSoup as bs

base = 'https://digital.nhs.uk/'
all_files = []

with requests.Session() as s:
    r = s.get('https://digital.nhs.uk/data-and-information/publications/statistical/patients-registered-at-a-gp-practice')
    soup = bs(r.content, 'lxml')
    links = [base + item['href'] for item in soup.select('.cta__button')]

    for link in links:
        r = s.get(link)
        soup = bs(r.content, 'lxml')
        file_links = {item.text.strip().split('\n')[0]:item['href'] for item in soup.select('[href$=".csv"]')}
        if file_links:
            all_files.append(file_links)  #ignore empty dicts as for some months there is no data yet
        else:
            print('no data : ' + link)

all_files = {k: v for d in all_files for k, v in d.items()}  #flatten list of dicts to single dict

path = 'C:/Users/User/Desktop/'

for k,v in all_files.items():
    #print(k,v)
    print(v)
    response = requests.get(v)
    html = response.content

    k = k.replace(':', ' -')
    file = path + k + '.csv'

    with open(file, 'wb' ) as f:
        f.write(html)
    break  #as only need one test case

Upvotes: 1

Related Questions