Reputation: 105
I have a customer segmentation project with unsupervised machine learning, with the original features of more than 300. I am in the data cleaning phase.
There are special two-level data: one is with coarse data, the other is fine data. For example as below:
Family: coarse category: 1,2,3 as family, fine data: 1 as young family, 2 as single-parent family.
Income: coarse: 1,2,3 as 1-100000, fine: 1: 1-3000, 2: 3001-6000,3:6000-10000
Are there any criteria that can be chosen to decide whether two levels should be kept, or just keep one level data?
FYI: after the data cleaning, I will use PCA and KMeans to make segmentation.
Upvotes: 0
Views: 112
Reputation: 1174
since the finer grained column contains all the information the coarser grained column does, you can just drop the coarser grained column avoiding correlated features.
However it finally depends on your model if it is bothered by correlated features or not and if it is capable to do the aggregation to the coarser level implicitly (e.g. decision trees can)
Upvotes: 1