Alex Woolfe
Alex Woolfe

Reputation: 9

Selecting categorical/discrete features and numerical features of sklearn dataset using Pandas

I am unable to spot feature(s) that are 'categorical/discrete'. I want to do this to then compute the frequency of each value of the categorical feature(s). And similarly, I want to eventually spot and use all numerical features.

In this dataset from sklearn, which feature(s) are categorical/discrete? And is there a way of automatically finding this through pandas for example. I know dtype can be used but that does not cover categorical or not since it is possible for number -> categorical.

from sklearn.datasets import load_wine
import pandas as pd

data = load_wine()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = pd.Series(data.target)
df.head(25)

Upvotes: 0

Views: 349

Answers (1)

mozway
mozway

Reputation: 262484

The question is subjective. Obviously integers are discrete, so you can get non-float values by rounding to integer and checking which columns only have integer values:

df.eq(df.astype(int)).all()

output:

alcohol                         False
malic_acid                      False
ash                             False
alcalinity_of_ash               False
magnesium                        True
total_phenols                   False
flavanoids                      False
nonflavanoid_phenols            False
proanthocyanins                 False
color_intensity                 False
hue                             False
od280/od315_of_diluted_wines    False
proline                          True
target                           True
dtype: bool

However, this still gives many possibilities for some integer columns. Do you want to set a threshold? e.g. max 10 different categories?

df.eq(df.astype(int)).all() & df.astype(int).nunique().lt(10)

output:

alcohol                         False
malic_acid                      False
ash                             False
alcalinity_of_ash               False
magnesium                       False
total_phenols                   False
flavanoids                      False
nonflavanoid_phenols            False
proanthocyanins                 False
color_intensity                 False
hue                             False
od280/od315_of_diluted_wines    False
proline                         False
target                           True
dtype: bool

Upvotes: 2

Related Questions