Álvaro V.
Álvaro V.

Reputation: 39

VarianceThreshold() not returning expected output

I'm on the stage of cleaning categorical variables from my data. More specifically, I'm now removing quasi-constant categorical variables.

I've searched and found that VarianceThreshold() from sklearn.feature_selection can do the job. However, I've got unexpected results. My piece of code:

# Create a temporal dataframe that fill null values with string "null_val", as OrdinalEncoder() doesn't work with null values
temp_df = train_df_cat.fillna("null_val")

# Initiate ordinal encoder and encode the different labels with numbers, then convert result to dataframe
ord_enc = OrdinalEncoder()
temp_df = ord_enc.fit_transform(temp_df)
temp_df = pd.DataFrame(temp_df, columns=train_df_cat.columns)

# Get the columns with 90% or more values being constant
var_thr = VarianceThreshold(threshold = 0.1)
var_thr.fit(temp_df)
quasi_constant_cat = [column for column in temp_df.columns
                      if column not in temp_df.columns[var_thr.get_support()]]

# Display results
display(quasi_constant_cat)

Returns this:

['Street',
 'Utilities',
 'LandSlope',
 'Condition2',
 'Heating',
 'CentralAir',
 'PoolQC']

Supposedly, those are the features that have any value present 90% or more of the time. However:

display(temp_df["Alley"].value_counts(normalize=True))

Returns, as I had seen on a plot above:

2.00   0.94
0.00   0.03
1.00   0.03
Name: Alley, dtype: float64

Therefore, Alley feature (and maybe others) has a 94% of the same value 2.00 (which actually is the number imputed for null values in this temp_df), but is not included in the output of the VarianceThreshold() function.

What should I change in my code to make this function work properly?

Upvotes: 0

Views: 493

Answers (1)

dx2-66
dx2-66

Reputation: 2851

The variance in this particular case would be (1.91^2 * 0.94 + 1^2 * 0.03) - (1.91 * 0.94 + 1 * 0.03)^2 = 0.1419 > 0.1

Looks like you'll need a bit higher threshold.

Upvotes: 1

Related Questions