Reputation: 105
I am trying to load with Pandas a csv file directly from a url. The csv file is compressed as a .gz file:
#Importing libraries
import pandas as pd
import requests
import io
#defining the url
url = "https://data.brasil.io/dataset/covid19/caso_full.csv.gz"
An here is the error:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-28-58ebbb6aba80> in <module>
7 url = "https://data.brasil.io/dataset/covid19/caso_full.csv.gz"
8 s=requests.get(url).content
----> 9 df=pd.read_csv(io.StringIO(s.decode('utf-8')), sep=',', compression='gzip', index_col=0, quotechar='"')
10
11 #df=pd.read_csv("caso_full.csv.gz")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
s=requests.get(url).content
df=pd.read_csv(io.StringIO(s.decode('utf-8')), sep=',', compression='gzip', index_col=0, quotechar='"')
If I download the file directly there is no error when I open it:
#Importing libraries
import pandas as pd
df=pd.read_csv("caso_full.csv.gz")
Any tips on why this is happening?
Thank you!
Upvotes: 2
Views: 2658
Reputation: 2836
The issue is that you are decoding the content and then using io.StringIO
.
The solution is to not decode the bytes and use io.BytesIO
.
See this Stack Overflow answer: https://stackoverflow.com/a/38131261/1913726
The url returns the content as GNU ZIP. pd.read_csv
expects a file path or buffer as its first argument. Because the content is bytes, a io.BytesIO
object must be used. Pandas then handles the decompression of the data into a CSV file.
import io
import pandas as pd
import requests
# defining the url
url = "https://data.brasil.io/dataset/covid19/caso_full.csv.gz"
response = requests.get(url)
content = response.content
print(type(content))
df = pd.read_csv(
io.BytesIO(content), sep=",", compression="gzip", index_col=0, quotechar='"',
)
print(df.head())
OUTPUT:
<class 'bytes'>
city_ibge_code date epidemiological_week estimated_population_2019 ... place_type state new_confirmed new_deaths
city ...
São Paulo 3550308.0 2020-02-25 9 12252023.0 ... city SP 1 0
NaN 35.0 2020-02-25 9 45919049.0 ... state SP 1 0
São Paulo 3550308.0 2020-02-26 9 12252023.0 ... city SP 0 0
NaN 35.0 2020-02-26 9 45919049.0 ... state SP 0 0
São Paulo 3550308.0 2020-02-27 9 12252023.0 ... city SP 0 0
[5 rows x 16 columns]
Upvotes: 5