Guilherme Rodrigues
Guilherme Rodrigues

Reputation: 11

Trouble cleaning a dataframe, values dtypes are object

url = 'https://www2.bmf.com.br/pages/portal/bmfbovespa/boletim1/SistemaPregao_excel1.asp?Data=&Mercadoria=DI1'
df_list = pd.read_html(url)
data_raw = df_list[6].copy().drop([0])
vencto_col = data_raw[0]
ajuste_col = data_raw[13]
ajuste_col.info()
ajuste_col

if we run this, in a jupyter notebook, the returns are:

<class 'pandas.core.series.Series'>
RangeIndex: 40 entries, 1 to 40
Series name: 13
Non-Null Count  Dtype 
--------------  ----- 
39 non-null     object
dtypes: object(1)
memory usage: 452.0+ bytes
1        AJUSTE
2     99.544,07
3     98.486,64
4     97.485,03
5     96.492,84
6     95.411,60
7     94.337,31
8     93.469,97
9     92.381,59
10    91.537,57
11    90.516,53
12    89.588,95

So, info tells me that this values are objects but when we print it, they are values and a dataframe. What I'm missing here and how I can get numbers(float64) and a real dataframe ?

Upvotes: 0

Views: 36

Answers (1)

fsimonjetz
fsimonjetz

Reputation: 5802

object is just a generic type, often indicating that the column contains strings (and nothing more specific like int, float, datetime etc.). You need to set the thousands and decimal parameters when calling read_html so pandas can correctly parse the data, i.e.,

df_list = pd.read_html(url, thousands='.', decimal=',')

Upvotes: 1

Related Questions