Reputation: 11
Here is a small data frame that contains a very small slice of data that I need to encode. DataFrame to Encode
My current effort in doing this is using SciKit-Learns LabelEncoder(),
le = preprocessing.LabelEncoder()
le.fit(["local", "animals", "local", "diet", "food", "health", "local", "police brutality", "police", "kids", "dogs"])
list(le.classes_)
(output)
['animals',
'diet',
'dogs',
'food',
'health',
'kids',
'local',
'police',
'police brutality']
I have now added all my desired targets to the encoder, so now I need to start encoding. The problem is the LabelEncoder takes arguments like this.
le.transform(["local"]) #For the first row in the data frame
(output) array([6])
Now thats the correct encoding for the first row, but how would I do this for every other row? I don't think writing it by hand is very doable as my actual data set is about 6000 samples.
I'm also not sure if the targets should be comma separated or not, I can always change that, but my end goal is to get a new data frame with encoded labels instead of the categorial labels.
Also, since the encoder returns a single array, if I were to do the same things for every row, each with a different amount of labels (i.e (dogs, animals) instead of (local)), I would need to append every array to make a matrix, but that is also something I have no clue how I should do. Thanks so much for the help!
Upvotes: 1
Views: 4851
Reputation: 311
try this
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb.fit([["local", "animals", "local", "diet", "food", "health", "local", "police brutality", "police", "kids", "dogs"]])
Upvotes: 0
Reputation: 12582
I think you'd want MultiLabelBinarizer
. For sklearn multilabel models anyway, the expected format is the boolean array rather than an array of lists of integers.
Upvotes: 1
Reputation: 106
I've had this issue recently too, here's a function that does the job:
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
class MultiColumnLabelEncoder:
def __init__(self,columns = None):
self.columns = columns # array of column names to encode
def fit(self,X,y=None):
return self # not relevant here
def transform(self,X):
'''
Transforms columns of X specified in self.columns using
LabelEncoder(). If no columns specified, transforms all
columns in X.
'''
output = X.copy()
if self.columns is not None:
for col in self.columns:
output[col] = LabelEncoder().fit_transform(output[col])
else:
for colname,col in output.iteritems():
output[colname] = LabelEncoder().fit_transform(col)
return output
def fit_transform(self,X,y=None):
return self.fit(X,y).transform(X)
And here's how you'd use it on your example
dataset = MultiColumnLabelEncoder(columns = ["local", "animals", "local", "diet", "food", "health", "local", "police brutality", "police", "kids", "dogs"]).fit_transform(dataset)
Upvotes: 0