raja
raja

Reputation: 450

Machine Learning: combining features into single feature

I am a beginner in machine learning. I am confused how to combine different features of a data set into one single feature.

For example, I have a data set in Python Pandas data frame of features like this:

movie        unknown action adventure animation fantasy horror romance sci-fi

Toy Story    0       1      1          0        1       0      0       1              
Golden Eye   0       1      0          0        0       0      1       0      
Four Rooms   1       0      0          0        0       0      0       0    
Get Shorty   0       0      0          1        1       0      1       0
Copy Cat     0       0      1          0        0       1      0       0    

I would like to convert this n features into one single feature named "movie_genre". One solution would be assign an integer value to each genre (unknown = 0, action = 1, adventure = 2 ..etc) and create a data frame like this:

movie       genre
Toy Story   1,2,4,7
Golden Eye  1,6
Four Rooms  0
Get Shorty  3,4,6
Copy Cat    2,5

But in this case the entries in the column will be no longer an integer/ float value. Will that affect my future steps in machine learning process like fitting model and evaluating the algorithms?

Upvotes: 2

Views: 10370

Answers (3)

Michael Kirchner
Michael Kirchner

Reputation: 889

It may be effective to leave them in their current multi-feature format and perform some sort of dimensionality reduction technique on that data.

This is very similar to a classic question: how do we treat categorical variables? One answer is one-hot or dummy encoding, which your original DataFrame is very similar to. With one-hot encoding, you start with a single, categorical feature. Using that feature, you make a column for each level, and assign a binary value to that column. The encoded result looks quite similar to what you are starting with. This sort of encoding is popular and many find it quite effective. Yours takes this one step further as each movie could be multiple genres. I'm not sure reversing that is a good idea.

Simply having more features is not always a bad thing if it is representing the data appropriately, and if you have enough observations. If you end up with a prohibitive number of features, there are many ways of reducing dimensionality. There is a wealth of knowledge on this topic out there, but one common technique is to apply principal component analysis (PCA) to a higher-dimensional dataset to find a lower-dimensional representation.

Since you're using python, you might want to check out what's available in scikit-learn for more ideas. A few resources in their documentation can be found here and here.

Upvotes: 1

user5725006
user5725006

Reputation:

One thing you can do is to make a matrix of all possible combinations and reshape it into a single vector. If you want to account for all combinations it will have the same length as the original. If there are combinations that you don't need simply don't take them into account. Your network is label-agnostic and it won't mind.

But why is that a problem? Your dataset looks small enough.

Upvotes: 1

Mohammad Athar
Mohammad Athar

Reputation: 1980

convert each series of zeros and ones into an 8-bit number

toy story = 01101001

in binary, that's 105

similarly, Golden Eye=01000010 = 26946

you can do the rest here manually: http://www.binaryhexconverter.com/binary-to-decimal-converter

it's relatively straight forward to do programatically - just look through each label, and assign it the appropriate power of two then sum them up

Upvotes: 2

Related Questions