Matt
Matt

Reputation: 813

Merge changes Pandas types

I'm using Python 3 (don't know if the info is relevant). I have 2 Pandas DataFrames (coming from read_csv()): Compact and SDSS_DR7_to_DR8. Before merge, they contain types as follow :

Compact.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2070 entries, 0 to 2069
Data columns (total 8 columns):
Group    2070 non-null int64
Id       2070 non-null int64
RA       2070 non-null float64
Dec      2070 non-null float64
z        2070 non-null float64
R        2070 non-null float64
G        2070 non-null float64
objid    2070 non-null int64
dtypes: float64(5), int64(3)
memory usage: 129.5 KB

And

SDSS_DR7_to_DR8.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243500 entries, 0 to 243499
Data columns (total 5 columns):
specobjid    243500 non-null int64
dr8objid     243500 non-null int64
dr7objid     243500 non-null int64
ra           243500 non-null float64
dec          243500 non-null float64
dtypes: float64(2), int64(3)
memory usage: 9.3 MB

I perform a Compact=pd.merge(Compact, SDSS_DR7_to_DR8, left_on=['objid'], right_on=['dr8objid'], how='left'). It is executed without error, but the result is a mess. When I check the types in the new DataFrame, I get this:

Compact.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2070 entries, 0 to 2069
Data columns (total 13 columns):
Group        2070 non-null int64
Id           2070 non-null int64
RA           2070 non-null float64
Dec          2070 non-null float64
z            2070 non-null float64
R            2070 non-null float64
G            2070 non-null float64
objid        2070 non-null int64
specobjid    1275 non-null float64
dr8objid     1275 non-null float64
dr7objid     1275 non-null float64
ra           1275 non-null float64
dec          1275 non-null float64
dtypes: float64(10), int64(3)
memory usage: 226.4 KB

So during the merge, dr8objid (and some others) has (have) been cast to float64. How is it possible, and what can I do to prevent this (hoping this is the origin of the mess in the merge)?

EDIT So, to be more specific: if I create df

df=pd.DataFrame(data=[[1000000000000000000,1]], columns=['key','data'])

key and data are both int64. I create a transcoding df:

trans=pd.DataFrame(data=[[1000000000000000000,2000000000000000000]], 
                   columns=['key','key2'])

which 2 keys are int64. Then

df2 = pd.merge(df, trans, on=['key'], how='left')

Gives a nice result, and key, key2 and data are still int64.

Nevertheless, if I define

df=pd.DataFrame(data=[[1000000000000000000,1],[1000000000000000001,2]],
                columns=['key','data'])

Now after the merge, I get

and now key2 has switched to a float64. How to prevent this? Is it because NaN must be connected with a float ? If so, is it possible to set the merge to define the result of merging to 0 or -1 if there is no correspondance, keeping the whole column to int64?

Upvotes: 5

Views: 5219

Answers (1)

w-m
w-m

Reputation: 11232

Update: in Pandas 0.24, there now are Nullable integer data types.

As of this writing, the Pandas does not seem choose the nullable int data type for the result of the merge. But it's possible to convert both arrays to the nullable int type Int64 before the merge.

Consider

df=pd.DataFrame(data=[[1000000000000000000,1],[1000000000000000001,2]],
                columns=['key','data']).astype("Int64")
trans=pd.DataFrame(data=[[1000000000000000000,2000000000000000000]], 
                   columns=['key','key2']).astype("Int64")
df2 = pd.merge(df, trans, on=['key'], how='left')

Result:

>>> df2
                   key  data                 key2
0  1000000000000000000     1  2000000000000000000
1  1000000000000000001     2                 <NA>

>>> df2.dtypes
key     Int64
data    Int64
key2    Int64
dtype: object

Original answer, for Pandas < v0.24:

Is it because NaN must be connected with a float ?

Correct. There is no NaN value in an int, so missing values can only be represented in floats.

You could either filter your data before merging, making sure that there are no NaNs created.

Or you could fill in the NaNs with a value of your choosing after the merge, and then restore the dtype.

Upvotes: 4

Related Questions