Problem loading a compressed (.gz) .csv file from url

Question

I am trying to load with Pandas a csv file directly from a url. The csv file is compressed as a .gz file:

#Importing libraries
import pandas as pd
import requests
import io

#defining the url
url = "https://data.brasil.io/dataset/covid19/caso_full.csv.gz"

An here is the error:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
 in 
      7 url = "https://data.brasil.io/dataset/covid19/caso_full.csv.gz"
      8 s=requests.get(url).content
----> 9 df=pd.read_csv(io.StringIO(s.decode('utf-8')), sep=',', compression='gzip', index_col=0, quotechar='"')
     10 
     11 #df=pd.read_csv("caso_full.csv.gz")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte


    s=requests.get(url).content
    df=pd.read_csv(io.StringIO(s.decode('utf-8')), sep=',', compression='gzip', index_col=0, quotechar='"')

If I download the file directly there is no error when I open it:

#Importing libraries
import pandas as pd
df=pd.read_csv("caso_full.csv.gz")

Any tips on why this is happening?

Thank you!

dmmfll · Accepted Answer

The issue is that you are decoding the content and then using io.StringIO.

The solution is to not decode the bytes and use io.BytesIO.

See this Stack Overflow answer: https://stackoverflow.com/a/38131261/1913726

The url returns the content as GNU ZIP. pd.read_csv expects a file path or buffer as its first argument. Because the content is bytes, a io.BytesIO object must be used. Pandas then handles the decompression of the data into a CSV file.

import io
import pandas as pd
import requests

# defining the url
url = "https://data.brasil.io/dataset/covid19/caso_full.csv.gz"
response = requests.get(url)
content = response.content
print(type(content))
df = pd.read_csv(
    io.BytesIO(content), sep=",", compression="gzip", index_col=0, quotechar='"',
)
print(df.head())

OUTPUT:

city_ibge_code        date  epidemiological_week  estimated_population_2019  ...  place_type  state  new_confirmed  new_deaths
city                                                                                    ...
São Paulo       3550308.0  2020-02-25                     9                 12252023.0  ...        city     SP              1           0
NaN                  35.0  2020-02-25                     9                 45919049.0  ...       state     SP              1           0
São Paulo       3550308.0  2020-02-26                     9                 12252023.0  ...        city     SP              0           0
NaN                  35.0  2020-02-26                     9                 45919049.0  ...       state     SP              0           0
São Paulo       3550308.0  2020-02-27                     9                 12252023.0  ...        city     SP              0           0

[5 rows x 16 columns]

Problem loading a compressed (.gz) .csv file from url

Answers (1)

Related Questions