iuvbio
iuvbio

Reputation: 649

How to force MultiIndex level to dtype

I use combine_first to combine two DataFrames based on two keys, the goal is to append the indices of df2 that are not contained in df1 to the result and to overwrite indices that are contained in both with the values from df2.

Example df1:

df1 = pd.DataFrame({
    "key1": ["A", "A", "A", "B", "B", "C", "C"],
    "id": ["a1", "a2", "a3", 1, 2, "c1", "c2"],
    "data1": [pd.np.random.randint(5) for i in range(7)],
    "data2": [pd.np.random.randint(1000) for i in range(7)]
})

Example df2:

df2 = pd.DataFrame({
    "key1": ["B", "B", "B"],
    "id": [2, 3, 4],
    "data1": [pd.np.random.randint(5) for i in range(3)],
    "data2": [pd.np.random.randint(1000) for i in range(3)]
})

df1.set_index(["key1", "id"]).combine_first(df2.set_index(["key1", "id"])) gives the desired result:

         data1  data2
key1 id              
A    a1    0.0  588.0
     a2    2.0  709.0
     a3    3.0  877.0
B    1     3.0  468.0
     2     0.0  612.0
     3     2.0  139.0
     4     3.0  154.0
C    c1    4.0  855.0
     c2    4.0  564.0

However, after storing the result as csv, loading it again and running the same command I get the following error:

TypeError: '<' not supported between instances of 'str' and 'int'

But only for a df2 where id is int only. When id contains characters it works without a problem.

I found this in the Pandas docs:

The different indexing operation can potentially change the dtype of a Series.

Which explains why the problem persists if I change the dtype of df2.id to object before setting the index and combining. How can I specifically set the dtype of a MultiIndex level so that the combination works?


EDIT To further illustrate the problem:

df = df1.set_index(["key1", "id"]).combine_first(df2.set_index(["key1", "id"]))
df.to_csv("tests/combtest2.csv", sep=";")
df_loaded = pd.read_csv("tests/combtest2.csv", sep=";", index_col=["key1", "id"])

Looks fine:

         data1  data2
key1 id              
A    a1    0.0  588.0
     a2    2.0  709.0
     a3    3.0  877.0
B    1     3.0  468.0
     2     0.0  612.0
     3     2.0  139.0
     4     3.0  154.0
C    c1    4.0  855.0
     c2    4.0  564.0

But df_loaded.combine_first(df2.set_index(["key1", "id"])) results in:

         data1  data2
key1 id              
A    a1    0.0  588.0
     a2    2.0  709.0
     a3    3.0  877.0
B    1     3.0  468.0
     2     0.0  612.0
     3     2.0  139.0
     4     3.0  154.0
C    c1    4.0  855.0
     c2    4.0  564.0
B    2     2.0  317.0
     3     2.0  139.0
     4     3.0  154.0

Upvotes: 1

Views: 91

Answers (1)

Qusai Alothman
Qusai Alothman

Reputation: 2072

You should convert the id column to str, not object, before combining.

This will work:

df2.id = df2.id.astype(str)
df_loaded.combine_first(df2.set_index(["key1", "id"]))

Upvotes: 1

Related Questions