Annie
Annie

Reputation: 105

Feature selection: coarse or fine data

I have a customer segmentation project with unsupervised machine learning, with the original features of more than 300. I am in the data cleaning phase.

There are special two-level data: one is with coarse data, the other is fine data. For example as below:

  1. Family: coarse category: 1,2,3 as family, fine data: 1 as young family, 2 as single-parent family.

  2. Income: coarse: 1,2,3 as 1-100000, fine: 1: 1-3000, 2: 3001-6000,3:6000-10000

Are there any criteria that can be chosen to decide whether two levels should be kept, or just keep one level data?

FYI: after the data cleaning, I will use PCA and KMeans to make segmentation.

Upvotes: 0

Views: 112

Answers (1)

Paul
Paul

Reputation: 1174

since the finer grained column contains all the information the coarser grained column does, you can just drop the coarser grained column avoiding correlated features.
However it finally depends on your model if it is bothered by correlated features or not and if it is capable to do the aggregation to the coarser level implicitly (e.g. decision trees can)

Upvotes: 1

Related Questions