Reputation: 1255

Convert categorical variables from String to int representation

I have a numpy array of classification of text in the form of String array, i.e. y_train = ['A', 'B', 'A', 'C',...]. I am trying to apply SKlearn multinomial NB algorithm to predict classes for entire dataset.

I want to convert the String classes into integers to be able to input into the algorithm and convert ['A', 'B', 'A', 'C', ...] into ['1', '2', '1', '3', ...]

I can write a for loop to go through array and create a new one with int classifiers but is there a direct function to achieve this

Upvotes: 21

Answers (3)

ListenSoftware Louise Ai Agent

Reputation: 4263

Another way is use the astype('category').cat.codes of the dataframe to convert the string values into number

X=df[['User ID', 'Gender', 'Age', 'EstimatedSalary']]
X['Gender']=X['Gender'].astype('category').cat.codes

Upvotes: 12

Ted Petrou

Reputation: 62037

If you are using sklearn, I would suggest sticking with methods in that library that do these things for you. Sklearn has a number of ways of preprocessing data such as encoding labels. One of which is the sklearn.preprocessing.LabelEncoder function.

from sklearn.preprocessing import LabelEncoder  

le = LabelEncoder()
le.fit_transform(y_train)

Which outputs

array([0, 1, 0, 2])

Use le.inverse_transform([0,1,2]) to map back

Upvotes: 16

MaxU - stand with Ukraine

Reputation: 210982

Try factorize method:

In [264]: y_train = pd.Series(['A', 'B', 'A', 'C'])

In [265]: y_train
Out[265]:
0    A
1    B
2    A
3    C
dtype: object

In [266]: pd.factorize(y_train)
Out[266]: (array([0, 1, 0, 2], dtype=int64), Index(['A', 'B', 'C'], dtype='object'))

Demo:

In [271]: fct = pd.factorize(y_train)[0]+1

In [272]: fct
Out[272]: array([1, 2, 1, 3], dtype=int64)

Upvotes: 18

Convert categorical variables from String to int representation

Answers (3)

Related Questions