Reputation: 331
My dataframe looks like this:
col1 col2 col3
Aba xxx yyy
bab bhh jjj
ccc kkk lll
Aba xxx yyy
ccc kkk jjj
I want to replace unique values of each column with something like:
In col1
: Aba
gets replaced with a0
,bab
gets replaced with a1
, ccc
gets replaced with a2
whereever it appears in the column.
Similarly with col2
: xxx
gets replaced with b0
,bhh
gets replaced with b1
etc.
In short first column starts replacing unique values with a0,a1,a2,a3
2nd one with b0,b1,b2,b3
, third column with c0,c1,c2
.. etc .
Looking for a solution in loop so that I can process all the columns and all the unique values all together coz I have more than a million rows.
Thanks,
Upvotes: 0
Views: 2245
Reputation: 164623
Here is a numpy
solution. It should be efficient as list comprehension often faster than apply
+ lambda
.
Source for alphabet range: Alphabet range python
import pandas as pd
from string import ascii_lowercase
df = pd.DataFrame({'col1': {0: 'Aba', 1: 'bab', 2: 'ccc', 3: 'Aba', 4: 'ccc'},
'col2': {0: 'xxx', 1: 'bhh', 2: 'kkk', 3: 'xxx', 4: 'kkk'},
'col3': {0: 'yyy', 1: 'jjj', 2: 'lll', 3: 'yyy', 4: 'jjj'}})
a = df.values
f = np.array([np.unique(a[:, i], return_inverse=True)[1] for i in range(a.shape[1])]).T
res = list(ascii_lowercase[:a.shape[1]]) + \
pd.DataFrame(f.astype(str), columns=df.columns)
# col1 col2 col3
# 0 a0 b2 c2
# 1 a1 b0 c0
# 2 a2 b1 c1
# 3 a0 b2 c2
# 4 a2 b1 c0
Upvotes: 0
Reputation: 210832
I'd do it this way:
In [184]: ['a','b','c'] + df.apply(lambda x: pd.factorize(x)[0]).astype(str)
Out[184]:
col1 col2 col3
0 a0 b0 c0
1 a1 b1 c1
2 a2 b2 c2
3 a0 b0 c0
4 a2 b2 c1
a bit more generic approach:
import string
c = list(string.ascii_lowercase)[:len(df.columns)]
df1 = c + df.apply(lambda x: pd.factorize(x)[0]).astype(str)
df1
col1 col2 col3
0 a0 b0 c0
1 a1 b1 c1
2 a2 b2 c2
3 a0 b0 c0
4 a2 b2 c1
Upvotes: 3