Reputation: 672
Background:
I ran into problem executing code from a machine learning case. I've already solved the issue with an ugly workaround so I am able to execute the notebook, but I still do not fully understand the cause of the issue.
The issues arises when I try to execute the following code which is used to create dummy variables using OneHotEncoder from sklearn.
categorical_columns = ~np.in1d(train_X.dtypes, [int, float])
Although the codes executes without any error, it fails to recognize the numpy.int64 as int datatype therefore classifying all int64 datatype columns as categorical and parsing them into the OneHotEncoder.
train_X
is a pandas dataframe object with the following columns and datatypes, as you can see the integers are stored as numpy.int64.
The code was originally written in Jupyter Notebook on a Mac where it worked fine and it also ran fine in Colaboraty on the Google cloud. All others who tried running the code from Jupyter on their almost identical Windows machines had the same issue as I did when running the script.
The Problem:
It seems that on windows machines, the numpy.int64 is not linked to the native int datatype.
Things I've tried and verified
I noted the strange "on win32" here but it seems merely a product of the "infinite wisdom of Microsoft" according to post 1 and post 2
Question:
Why does numpy.int64 not translate into a native int datatype on Windows while everything is running 64 bit, where it does on Mac and other systems?
Upvotes: 1
Views: 1217
Reputation: 30579
I don't have an answer as to why the default int
on Windows 64 is int32
but it is a very confusing fact:
np.dtype('int')
returns dtype('int32')
on 64 bit Windows and dtype('int64')
on 64 bit Linux.
See also the second warning here and this numpy github issue.
In your concrete case I'd use pandas' is_numeric_dtype
function to check numeric-ness in a platform independed and straightforward way:
from pandas.api.types import is_numeric_dtype
categorical_columns = ~train_X.dtypes.apply(is_numeric_dtype).to_numpy()
Upvotes: 4