classification of data where attribute values are strings

Question

I have a labeled data set with 7 attributes and about 80,000 rows. However, 3 of these attributes contain more than 50% missing data. I filtered the data to ignore rows with any null values which left me with about 30,000 rows of complete data. The format of the values of each attribute are strings as in "this is the value of an instance of attribute i." The desired output (labels) are binary (0 or 1) and there is a label associated with every instance. I want to train a classifier to predict the desired output on a test set. I am using Python and sklearn, and am stuck on how to extract features from this dataset. Any recommendations would be much appreciated. Thanks

jakevdp · Accepted Answer

Scikit-learn has several tools explicitly designed to extract features from text inputs; see the Text Feature Extraction section of the docs.

Here's an example of a classifier built from a list of strings:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

data = [['this is about dogs', 'dogs are really great'],
        ['this is about cats', 'cats are evil']]
labels = ['dogs',
          'cats']

vec = CountVectorizer()  # count word occurrences
X = vec.fit_transform([' '.join(row) for row in data])

clf = MultinomialNB()  # very simple model for word counts
clf.fit(X, labels)

new_data = ['this is about cats too', 'I think cats are awesome']
new_X = vec.transform([' '.join(new_data)])

print(clf.predict(new_X))
# ['cats']

classification of data where attribute values are strings

Answers (1)

Related Questions