Reputation: 9
I am unable to spot feature(s) that are 'categorical/discrete'. I want to do this to then compute the frequency of each value of the categorical feature(s). And similarly, I want to eventually spot and use all numerical features.
In this dataset from sklearn, which feature(s) are categorical/discrete? And is there a way of automatically finding this through pandas for example. I know dtype can be used but that does not cover categorical or not since it is possible for number -> categorical.
from sklearn.datasets import load_wine
import pandas as pd
data = load_wine()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = pd.Series(data.target)
df.head(25)
Upvotes: 0
Views: 349
Reputation: 262484
The question is subjective. Obviously integers are discrete, so you can get non-float values by rounding to integer and checking which columns only have integer values:
df.eq(df.astype(int)).all()
output:
alcohol False
malic_acid False
ash False
alcalinity_of_ash False
magnesium True
total_phenols False
flavanoids False
nonflavanoid_phenols False
proanthocyanins False
color_intensity False
hue False
od280/od315_of_diluted_wines False
proline True
target True
dtype: bool
However, this still gives many possibilities for some integer columns. Do you want to set a threshold? e.g. max 10 different categories?
df.eq(df.astype(int)).all() & df.astype(int).nunique().lt(10)
output:
alcohol False
malic_acid False
ash False
alcalinity_of_ash False
magnesium False
total_phenols False
flavanoids False
nonflavanoid_phenols False
proanthocyanins False
color_intensity False
hue False
od280/od315_of_diluted_wines False
proline False
target True
dtype: bool
Upvotes: 2