mwoods
mwoods

Reputation: 317

Strategies for handling nominal values with numerical attributes

I'm using a data set that consists of mostly nominal values from SFDC (e.g. EE Names, Title, Role, Lead Source, Account Name, etc.) and am trying to correlate the features to a boolean class of whether a Sales Lead was converted to a Sales Contact.

I wanted to run this data through some basic feature selection algorithms, but most require numerical values only. I could map each of the unique classifications to a new field(feature) with a boolean mapping scheme, but then i'll generate an extremely large number of new features and I'm not sure if that will give a meaningful output. Admittedly the best solution might be to run the data through a decision tree, but wanted to see if there were any other strategies that others have come up with in the community for handling data sets of mostly nominal data that have been successfully used on real world applications.

I'm using python with scipy/numpy/pandas/scikit-learn to do my analysis.

Upvotes: 0

Views: 913

Answers (1)

ogrisel
ogrisel

Reputation: 40159

I would first try to use sklearn.feature_extraction.DictVectorizer and then try Chi2 univariate feature selection that can work with sparse data representations. For instance there is an application of chi2 feature selection on sparse text data here in scikit-learn: http://scikit-learn.org/dev/auto_examples/document_classification_20newsgroups.html

Unfortunately, scikit-learn's decision trees and ensemble do not work on sparse representations yet.

Upvotes: 1

Related Questions