CODEWITHSUNDEEP

pythonpandasdataframe

Reputation: 331

Pandas dataframe replace unique values in a column

My dataframe looks like this:

col1  col2  col3
Aba   xxx   yyy
bab   bhh   jjj
ccc   kkk   lll
Aba   xxx   yyy
ccc   kkk   jjj

I want to replace unique values of each column with something like: In col1: Aba gets replaced with a0,bab gets replaced with a1, ccc gets replaced with a2 whereever it appears in the column.

Similarly with col2: xxx gets replaced with b0,bhh gets replaced with b1 etc.

In short first column starts replacing unique values with a0,a1,a2,a3 2nd one with b0,b1,b2,b3, third column with c0,c1,c2.. etc .

Looking for a solution in loop so that I can process all the columns and all the unique values all together coz I have more than a million rows.

Thanks,

Upvotes: 0

Views: 2245

Answers (2)

jpp

Reputation: 164623

Here is a numpy solution. It should be efficient as list comprehension often faster than apply + lambda.

Source for alphabet range: Alphabet range python

import pandas as pd
from string import ascii_lowercase

df = pd.DataFrame({'col1': {0: 'Aba', 1: 'bab', 2: 'ccc', 3: 'Aba', 4: 'ccc'},
                   'col2': {0: 'xxx', 1: 'bhh', 2: 'kkk', 3: 'xxx', 4: 'kkk'},
                   'col3': {0: 'yyy', 1: 'jjj', 2: 'lll', 3: 'yyy', 4: 'jjj'}})

a = df.values
f = np.array([np.unique(a[:, i], return_inverse=True)[1] for i in range(a.shape[1])]).T

res = list(ascii_lowercase[:a.shape[1]]) + \
      pd.DataFrame(f.astype(str), columns=df.columns)

#   col1 col2 col3
# 0   a0   b2   c2
# 1   a1   b0   c0
# 2   a2   b1   c1
# 3   a0   b2   c2
# 4   a2   b1   c0

Upvotes: 0

MaxU - stand with Ukraine

MaxU - stand with Ukraine

Reputation: 210832

I'd do it this way:

In [184]: ['a','b','c'] + df.apply(lambda x: pd.factorize(x)[0]).astype(str)
Out[184]:
  col1 col2 col3
0   a0   b0   c0
1   a1   b1   c1
2   a2   b2   c2
3   a0   b0   c0
4   a2   b2   c1

a bit more generic approach:

import string

c = list(string.ascii_lowercase)[:len(df.columns)]
df1 = c + df.apply(lambda x: pd.factorize(x)[0]).astype(str)
df1
  col1 col2 col3
0   a0   b0   c0
1   a1   b1   c1
2   a2   b2   c2
3   a0   b0   c0
4   a2   b2   c1

Upvotes: 3

Related Questions