Reputation: 379
How do I download a file:
COVID-19 Data to be able to save one of its sheets named Covid-19 - Weekly occurrences
as a dataframe.
The url works if I put it in a browser.
I have tried:
import requests
import io
import pandas as pd
url = 'https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fbirthsdeathsandmarriages%2fdeaths%2fdatasets%2fweeklyprovisionalfiguresondeathsregisteredinenglandandwales%2f2020/referencetablescorrected.xlsx'
s=requests.get(url).content
df_deathsAges = pd.read_excel(io.StringIO(s.decode('utf-8')),
nrows = 25, header = 5, sheet_name='Covid-19 - Weekly occurrences')
but I get the error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 15: invalid start byte
I have tried:
url = 'https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fbirthsdeathsandmarriages%2fdeaths%2fdatasets%2fweeklyprovisionalfiguresondeathsregisteredinenglandandwales%2f2020/referencetablescorrected.xlsx'
df_deathsAges = pd.read_excel(url,'Covid-19 - Weekly occurrences')
But I get the error:
HTTPError: HTTP Error 403: Forbidden
What is the best way to proceed to accomplish this task?
Upvotes: 5
Views: 5754
Reputation: 5601
use pd.ExcelFile
to parse multi-sheet xlsx files.
url = 'https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fbirthsdeathsandmarriages%2fdeaths%2fdatasets%2fweeklyprovisionalfiguresondeathsregisteredinenglandandwales%2f2020/referencetablescorrected.xlsx'
s = requests.get(url).content
xl = pd.ExcelFile(s) # io.BytesIO(s) is also work
xl.sheet_names # see all sheet names
# xl.parse(sheet_name) # read a specific sheet to DataFrame
['Analysis',
'Contents',
'Information',
'Terms and conditions',
'Weekly figures 2020',
'Covid-19 - Weekly registrations',
'Covid-19 - Weekly occurrences',
'UK - Covid-19 - Weekly reg',
'Covid-19 - Daily registrations',
'Covid-19 - Daily occurrences',
'Covid-19 - Place of occurrence ',
'Estimated total deaths 2020',
'Related publications']
df_deathsAges = xl.parse('Covid-19 - Weekly occurrences', nrows = 25, header = 5)
df_deathsAges = df_deathsAges.dropna(how='all').dropna(how='all', axis=1)
Upvotes: 4
Reputation: 11
pd.read_excel can take bytes as input directly. This works for me:
df_deathsAges = pd.read_excel(s,
nrows = 25, header = 5, sheet_name='Covid-19 - Weekly occurrences')
Upvotes: 0
Reputation: 8269
xlsx
is a binary format and it's not valid UTF-8. Try to load it into pandas as a binary file stream:
import requests
import io
import pandas as pd
url = 'https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fbirthsdeathsandmarriages%2fdeaths%2fdatasets%2fweeklyprovisionalfiguresondeathsregisteredinenglandandwales%2f2020/referencetablescorrected.xlsx'
s=requests.get(url).content
df_deathsAges = pd.read_excel(io.BytesIO(s),
nrows = 25, header = 5, sheet_name='Covid-19 - Weekly occurrences', engine="openpyxl")
Note: I tested the code and was not able to read the xlsx
file with the default engine xlrd
, but was successful with openpyxl
Upvotes: 5