Reputation: 983
I am using 2 ways to feed a multi-index pandas DataFrame:
one that initializes it with values,
one that fills it row per row (I don't care about the speed, I have to do it this way as I have each new row after different computations, and not all at once)
Please, both DataFrames 'looks' the same, but pandas tells me they are not. What should I do so that they are the same?
import attr
import pandas as pd
@attr.s(frozen=True)
class CDE(object):
name : str = attr.ib()
def __str__(self):
return self.name
# Method 1: filling it row per row.
rmidx = pd.MultiIndex(levels=[[],[]], codes=[[],[]], names=['CDE','period'])
c_array = [['Column', 'Column'],['First', 'Last']]
cmidx = pd.MultiIndex.from_arrays(c_array)
summary1 = pd.DataFrame(index = rmidx, columns=cmidx)
mcde1 = CDE(name='hi')
mcde2 = CDE(name='ho')
values = [[20,30],[40,50]]
period = '5m'
summary1.loc[(mcde1, period),('Column',)]=values[0]
summary1.loc[(mcde2, period),('Column',)]=values[1]
# Method 2: all at once.
columns = summary1.columns
rows = summary1.index.names
index_label = [[mcde1,mcde2],[period,period]]
summary2 = pd.DataFrame(values, index=index_label, columns=columns)
summary2.index.names = rows
In [2]:summary1
Out[2]:
Column
First Last
CDE period
hi 5m 20 30
ho 5m 40 50
In [3]:summary2
Out[3]:
Column
First Last
CDE period
hi 5m 20 30
ho 5m 40 50
But
summary1.equals(summary2)
False
Why is that so? Thanks for any advice on that.
Upvotes: 0
Views: 267
Reputation: 150785
From the doc This function requires that the elements have the same dtype as their
respective elements in the other Series or DataFrame.
But if you do
summary1.dtypes, summary2.dtypes
You would get
(Column First object
Last object
dtype: object,
Column First int64
Last int64
dtype: object)
This is because when you create an empty data frame
summary1 = pd.DataFrame(index = rmidx, columns=cmidx)
the default dtype
is object
. Therefore, whenever you append a new row, the data is converted/masked as the given dtype
. On the other hand, if you create a data frame with given data, pandas
will try to guess the best dtype
, in this case int64
.
Upvotes: 3