Reputation: 649
I use combine_first
to combine two DataFrames based on two keys, the goal is to append the indices of df2 that are not contained in df1 to the result and to overwrite indices that are contained in both with the values from df2.
Example df1:
df1 = pd.DataFrame({
"key1": ["A", "A", "A", "B", "B", "C", "C"],
"id": ["a1", "a2", "a3", 1, 2, "c1", "c2"],
"data1": [pd.np.random.randint(5) for i in range(7)],
"data2": [pd.np.random.randint(1000) for i in range(7)]
})
Example df2:
df2 = pd.DataFrame({
"key1": ["B", "B", "B"],
"id": [2, 3, 4],
"data1": [pd.np.random.randint(5) for i in range(3)],
"data2": [pd.np.random.randint(1000) for i in range(3)]
})
df1.set_index(["key1", "id"]).combine_first(df2.set_index(["key1", "id"]))
gives the desired result:
data1 data2
key1 id
A a1 0.0 588.0
a2 2.0 709.0
a3 3.0 877.0
B 1 3.0 468.0
2 0.0 612.0
3 2.0 139.0
4 3.0 154.0
C c1 4.0 855.0
c2 4.0 564.0
However, after storing the result as csv, loading it again and running the same command I get the following error:
TypeError: '<' not supported between instances of 'str' and 'int'
But only for a df2 where id
is int
only. When id
contains characters it works without a problem.
I found this in the Pandas docs:
The different indexing operation can potentially change the dtype of a Series.
Which explains why the problem persists if I change the dtype of df2.id
to object
before setting the index and combining. How can I specifically set the dtype of a MultiIndex level so that the combination works?
EDIT To further illustrate the problem:
df = df1.set_index(["key1", "id"]).combine_first(df2.set_index(["key1", "id"]))
df.to_csv("tests/combtest2.csv", sep=";")
df_loaded = pd.read_csv("tests/combtest2.csv", sep=";", index_col=["key1", "id"])
Looks fine:
data1 data2
key1 id
A a1 0.0 588.0
a2 2.0 709.0
a3 3.0 877.0
B 1 3.0 468.0
2 0.0 612.0
3 2.0 139.0
4 3.0 154.0
C c1 4.0 855.0
c2 4.0 564.0
But df_loaded.combine_first(df2.set_index(["key1", "id"]))
results in:
data1 data2
key1 id
A a1 0.0 588.0
a2 2.0 709.0
a3 3.0 877.0
B 1 3.0 468.0
2 0.0 612.0
3 2.0 139.0
4 3.0 154.0
C c1 4.0 855.0
c2 4.0 564.0
B 2 2.0 317.0
3 2.0 139.0
4 3.0 154.0
Upvotes: 1
Views: 91
Reputation: 2072
You should convert the id
column to str
, not object
, before combining.
This will work:
df2.id = df2.id.astype(str)
df_loaded.combine_first(df2.set_index(["key1", "id"]))
Upvotes: 1