Reputation: 1453
I'm trying to turn a column of strings into integer identifiers...and I cannot find an elegant way of doing this in pandas (or python). In the following example, I transform "A", which is a column/variable of strings into numbers through a mapping, but it looks like a dirty hack to me
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': ['homer_simpson', 'mean_street', 'homer_simpson', 'bla_bla'], 'B': 4})
unique = df['A'].unique()
mapping = dict(zip(unique, np.arange(len(unique))))
new_df = df.replace({'A': mapping})
Is there a better, more direct, way of achieving this?
Upvotes: 4
Views: 284
Reputation: 11973
How about using factorize
?
>>> labels, uniques = df.A.factorize()
>>> df.A = labels
>>> df
A B
0 0 4
1 1 4
2 0 4
3 2 4
http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.factorize.html
Upvotes: 5
Reputation: 109726
A simple map on a transposed dictionary should get you what you want. All the values in the dictionary are unique, so transposing it won't result in duplicate keys.
df['A'] = df.A.map({val: n for n, val in enumerate(df['A'].unique())})
>>> df
A B
0 0 4
1 1 4
2 0 4
3 2 4
Upvotes: 1
Reputation: 353499
Assuming you don't care much about what the integers are, simply that there's a consistent mapping, you could (1) use the Categorical codes or (2) rank the values:
>>> df["A_categ"] = pd.Categorical(df.A).codes
>>> df["A_rank"] = df["A"].rank("dense").astype(int)
>>> df
A B A_categ A_rank
0 homer_simpson 4 1 2
1 mean_street 4 2 3
2 homer_simpson 4 1 2
3 bla_bla 4 0 1
Upvotes: 1