Leonardo Araujo
Leonardo Araujo

Reputation: 105

Problem loading a compressed (.gz) .csv file from url

I am trying to load with Pandas a csv file directly from a url. The csv file is compressed as a .gz file:

#Importing libraries
import pandas as pd
import requests
import io

#defining the url
url = "https://data.brasil.io/dataset/covid19/caso_full.csv.gz"

An here is the error:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-28-58ebbb6aba80> in <module>
      7 url = "https://data.brasil.io/dataset/covid19/caso_full.csv.gz"
      8 s=requests.get(url).content
----> 9 df=pd.read_csv(io.StringIO(s.decode('utf-8')), sep=',', compression='gzip', index_col=0, quotechar='"')
     10 
     11 #df=pd.read_csv("caso_full.csv.gz")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte


    s=requests.get(url).content
    df=pd.read_csv(io.StringIO(s.decode('utf-8')), sep=',', compression='gzip', index_col=0, quotechar='"')

If I download the file directly there is no error when I open it:

#Importing libraries
import pandas as pd
df=pd.read_csv("caso_full.csv.gz")

Any tips on why this is happening?

Thank you!

Upvotes: 2

Views: 2658

Answers (1)

dmmfll
dmmfll

Reputation: 2836

The issue is that you are decoding the content and then using io.StringIO.

The solution is to not decode the bytes and use io.BytesIO.

See this Stack Overflow answer: https://stackoverflow.com/a/38131261/1913726

The url returns the content as GNU ZIP. pd.read_csv expects a file path or buffer as its first argument. Because the content is bytes, a io.BytesIO object must be used. Pandas then handles the decompression of the data into a CSV file.

import io
import pandas as pd
import requests

# defining the url
url = "https://data.brasil.io/dataset/covid19/caso_full.csv.gz"
response = requests.get(url)
content = response.content
print(type(content))
df = pd.read_csv(
    io.BytesIO(content), sep=",", compression="gzip", index_col=0, quotechar='"',
)
print(df.head())

OUTPUT:

<class 'bytes'>
city_ibge_code        date  epidemiological_week  estimated_population_2019  ...  place_type  state  new_confirmed  new_deaths
city                                                                                    ...
São Paulo       3550308.0  2020-02-25                     9                 12252023.0  ...        city     SP              1           0
NaN                  35.0  2020-02-25                     9                 45919049.0  ...       state     SP              1           0
São Paulo       3550308.0  2020-02-26                     9                 12252023.0  ...        city     SP              0           0
NaN                  35.0  2020-02-26                     9                 45919049.0  ...       state     SP              0           0
São Paulo       3550308.0  2020-02-27                     9                 12252023.0  ...        city     SP              0           0

[5 rows x 16 columns]

Upvotes: 5

Related Questions