Reputation: 4102
I have multiple old tables that I need to compare to new tables, and I want to use pandas to compare them.
The data looks something like this:
data_old = [
["John", 1,2,3,None],
["Mary", 1,4,None,"Apples"],
["Jean", None,4,-4,"Peaches"],
]
columns = ['name', "A", "B", "C", "D"]
data_new = [
["Jean", 5,4,-4,"Peaches"],
["John", 1,-2,3,None],
["Mary", 1,4,None,"Apples"],
]
Here we have two relatively small datasets, and the values can be text, numeric, or NULL. I wanted to create sub-sets dataframe with the changes and export report dataframes to a csv.
My issue is this, when I sort both dataframes by 'name', I get the following error:
ValueError: Can only compare identically-labeled DataFrame objects
Here is what I am doing:
df_old = pd.DataFrame(data=data_old, columns=columns)
df_old.sort(columns='name', inplace=True)
df_new = pd.DataFrame(data=data_new, columns=columns)
df_new.sort(columns='name', inplace=True)
ne = (df_old != df_new).any(1) #ERROR
# to other stuff.....
Upvotes: 4
Views: 1474
Reputation: 863166
I think you need sort_values
instead sort
, because:
FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
and then need same index values in both DataFrames
what return reset_index
with parameter drop=True
:
df_old = pd.DataFrame(data=data_old, columns=columns)
df_old.sort_values('name', inplace=True)
df_new = pd.DataFrame(data=data_new, columns=columns)
df_new.sort_values('name', inplace=True)
print (df_old)
name A B C D
2 Jean NaN 4 -4.0 Peaches
0 John 1.0 2 3.0 None
1 Mary 1.0 4 NaN Apples
print (df_new)
name A B C D
0 Jean 5 4 -4.0 Peaches
1 John 1 -2 3.0 None
2 Mary 1 4 NaN Apples
df_old.reset_index(drop=True, inplace=True)
df_new.reset_index(drop=True, inplace=True)
ne = (df_old != df_new).any(1)
print (ne)
0 True
1 True
2 True
dtype: bool
Upvotes: 4