Reputation: 813
I'm using Python 3 (don't know if the info is relevant).
I have 2 Pandas DataFrames (coming from read_csv()
): Compact
and SDSS_DR7_to_DR8
. Before merge, they contain types as follow :
Compact.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2070 entries, 0 to 2069
Data columns (total 8 columns):
Group 2070 non-null int64
Id 2070 non-null int64
RA 2070 non-null float64
Dec 2070 non-null float64
z 2070 non-null float64
R 2070 non-null float64
G 2070 non-null float64
objid 2070 non-null int64
dtypes: float64(5), int64(3)
memory usage: 129.5 KB
And
SDSS_DR7_to_DR8.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243500 entries, 0 to 243499
Data columns (total 5 columns):
specobjid 243500 non-null int64
dr8objid 243500 non-null int64
dr7objid 243500 non-null int64
ra 243500 non-null float64
dec 243500 non-null float64
dtypes: float64(2), int64(3)
memory usage: 9.3 MB
I perform a Compact=pd.merge(Compact, SDSS_DR7_to_DR8, left_on=['objid'], right_on=['dr8objid'], how='left')
. It is executed without error, but the result is a mess. When I check the types in the new DataFrame, I get this:
Compact.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2070 entries, 0 to 2069
Data columns (total 13 columns):
Group 2070 non-null int64
Id 2070 non-null int64
RA 2070 non-null float64
Dec 2070 non-null float64
z 2070 non-null float64
R 2070 non-null float64
G 2070 non-null float64
objid 2070 non-null int64
specobjid 1275 non-null float64
dr8objid 1275 non-null float64
dr7objid 1275 non-null float64
ra 1275 non-null float64
dec 1275 non-null float64
dtypes: float64(10), int64(3)
memory usage: 226.4 KB
So during the merge, dr8objid
(and some others) has (have) been cast to float64
. How is it possible, and what can I do to prevent this (hoping this is the origin of the mess in the merge)?
EDIT
So, to be more specific: if I create df
df=pd.DataFrame(data=[[1000000000000000000,1]], columns=['key','data'])
key
and data
are both int64
. I create a transcoding df:
trans=pd.DataFrame(data=[[1000000000000000000,2000000000000000000]],
columns=['key','key2'])
which 2 keys are int64
. Then
df2 = pd.merge(df, trans, on=['key'], how='left')
Gives a nice result, and key
, key2
and data
are still int64
.
Nevertheless, if I define
df=pd.DataFrame(data=[[1000000000000000000,1],[1000000000000000001,2]],
columns=['key','data'])
and now key2
has switched to a float64
. How to prevent this? Is it because NaN
must be connected with a float ? If so, is it possible to set the merge to define the result of merging to 0 or -1 if there is no correspondance, keeping the whole column to int64
?
Upvotes: 5
Views: 5219
Reputation: 11232
Update: in Pandas 0.24, there now are Nullable integer data types.
As of this writing, the Pandas does not seem choose the nullable int data type for the result of the merge. But it's possible to convert both arrays to the nullable int type Int64
before the merge.
Consider
df=pd.DataFrame(data=[[1000000000000000000,1],[1000000000000000001,2]],
columns=['key','data']).astype("Int64")
trans=pd.DataFrame(data=[[1000000000000000000,2000000000000000000]],
columns=['key','key2']).astype("Int64")
df2 = pd.merge(df, trans, on=['key'], how='left')
Result:
>>> df2
key data key2
0 1000000000000000000 1 2000000000000000000
1 1000000000000000001 2 <NA>
>>> df2.dtypes
key Int64
data Int64
key2 Int64
dtype: object
Original answer, for Pandas < v0.24:
Is it because NaN must be connected with a float ?
Correct. There is no NaN value in an int, so missing values can only be represented in floats.
You could either filter your data before merging, making sure that there are no NaNs created.
Or you could fill in the NaNs with a value of your choosing after the merge, and then restore the dtype.
Upvotes: 4