nevster
nevster

Reputation: 379

How to download an xlsx file from a website to save as a dataframe with Pandas

How do I download a file:

COVID-19 Data to be able to save one of its sheets named Covid-19 - Weekly occurrences as a dataframe.

The url works if I put it in a browser.

I have tried:

import requests
import io
import pandas as pd    

url = 'https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fbirthsdeathsandmarriages%2fdeaths%2fdatasets%2fweeklyprovisionalfiguresondeathsregisteredinenglandandwales%2f2020/referencetablescorrected.xlsx'

s=requests.get(url).content
df_deathsAges = pd.read_excel(io.StringIO(s.decode('utf-8')), 
                          nrows = 25, header = 5, sheet_name='Covid-19 - Weekly occurrences')

but I get the error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 15: invalid start byte

I have tried:

url = 'https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fbirthsdeathsandmarriages%2fdeaths%2fdatasets%2fweeklyprovisionalfiguresondeathsregisteredinenglandandwales%2f2020/referencetablescorrected.xlsx'

df_deathsAges = pd.read_excel(url,'Covid-19 - Weekly occurrences')

But I get the error:

HTTPError: HTTP Error 403: Forbidden

What is the best way to proceed to accomplish this task?

Upvotes: 5

Views: 5754

Answers (3)

Ferris
Ferris

Reputation: 5601

use pd.ExcelFile to parse multi-sheet xlsx files.

url = 'https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fbirthsdeathsandmarriages%2fdeaths%2fdatasets%2fweeklyprovisionalfiguresondeathsregisteredinenglandandwales%2f2020/referencetablescorrected.xlsx'
s = requests.get(url).content

xl = pd.ExcelFile(s)  # io.BytesIO(s) is also work
xl.sheet_names        # see all sheet names
# xl.parse(sheet_name)  # read a specific sheet to DataFrame

['Analysis',
 'Contents',
 'Information',
 'Terms and conditions',
 'Weekly figures 2020',
 'Covid-19 - Weekly registrations',
 'Covid-19 - Weekly occurrences',
 'UK - Covid-19 - Weekly reg',
 'Covid-19 - Daily registrations',
 'Covid-19 - Daily occurrences',
 'Covid-19 - Place of occurrence ',
 'Estimated total deaths 2020',
 'Related publications']

df_deathsAges = xl.parse('Covid-19 - Weekly occurrences', nrows = 25, header = 5) 
df_deathsAges = df_deathsAges.dropna(how='all').dropna(how='all', axis=1)

Upvotes: 4

Liam Clegg
Liam Clegg

Reputation: 11

pd.read_excel can take bytes as input directly. This works for me:

df_deathsAges = pd.read_excel(s, 
                          nrows = 25, header = 5, sheet_name='Covid-19 - Weekly occurrences')

Upvotes: 0

foglerit
foglerit

Reputation: 8269

xlsx is a binary format and it's not valid UTF-8. Try to load it into pandas as a binary file stream:

import requests
import io
import pandas as pd    

url = 'https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fbirthsdeathsandmarriages%2fdeaths%2fdatasets%2fweeklyprovisionalfiguresondeathsregisteredinenglandandwales%2f2020/referencetablescorrected.xlsx'

s=requests.get(url).content
df_deathsAges = pd.read_excel(io.BytesIO(s),
                          nrows = 25, header = 5, sheet_name='Covid-19 - Weekly occurrences', engine="openpyxl")

Note: I tested the code and was not able to read the xlsx file with the default engine xlrd, but was successful with openpyxl

Upvotes: 5

Related Questions