emax
emax

Reputation: 7245

Assign unique id to columns pandas data frame

Hello I have the following dataframe

df = 
A      B   
John   Tom
Homer  Bart
Tom    Maggie
Lisa   John 

I would like to assign to each name a unique ID and returns

df = 
A      B         C    D

John   Tom       0    1
Homer  Bart      2    3
Tom    Maggie    1    4 
Lisa   John      5    0

What I have done is the following:

LL1 = pd.concat([df.a,df.b],ignore_index=True)
LL1 = pd.DataFrame(LL1)
LL1.columns=['a']
nameun = pd.unique(LL1.a.ravel())
LLout['c'] = 0
LLout['d'] = 0
NN = list(nameun)
for i in range(1,len(LLout)):
   LLout.c[i] = NN.index(LLout.a[i])
   LLout.d[i] = NN.index(LLout.b[i])

But since I have a very large dataset this process is very slow.

Upvotes: 5

Views: 8722

Answers (3)

Contango
Contango

Reputation: 80232

Method: Column ID Map

This method is blazingly fast and scales to hundreds of millions of rows on a single core. Tested on Python 3.10 + Windows 10 x64 and should work on all versions of Python/Linux.

# Get the unique values in the column.
dates_unique = np.unique(df["date"].values)
# Create a dictionary which assigns each unique value to an incrementing ID.
date_to_id_dict = dict(zip(dates_unique, np.arange(0, len(dates_unique))))
# Add a new column which maps each unique value to its unique ID.
df["date_id"] = tMaster_lite["date"].map(date_to_id_dict)

Appendix A: Worked example

import numpy as np
import pandas as pd
from pandas import Timestamp
df = pd.DataFrame({"date": [Timestamp("2023.01.01"), Timestamp("2023.01.01"), Timestamp("2023.01.02")]})
# Then run code above.

Output:

        date  date_id
0 2023-01-01        0
1 2023-01-01        0
2 2023-01-02        1

Appendix B: Extra for Experts

If we want a unique ID across multiple columns, we can combine unique values across many columns into one array, prior to converting it to a dictionary:

dates_unique = np.unique(np.concatenate([np.unique(df["date"].values), np.unique(df["date_ex"].values)]))

If we want to ensure that the dates are sorted prior to getting their unique ID:

dates_unique  = np.sort(dates_unique)

Upvotes: 0

DSM
DSM

Reputation: 353019

(Note: I'm assuming you don't care about the precise details of the mapping -- which number John becomes, for example -- but only that there is one.)

Method #1: you could use a Categorical object as an intermediary:

>>> ranked = pd.Categorical(df.stack()).codes.reshape(df.shape)
>>> df.join(pd.DataFrame(ranked, columns=["C", "D"]))
       A       B  C  D
0   John     Tom  2  5
1  Homer    Bart  1  0
2    Tom  Maggie  5  4
3   Lisa    John  3  2

It feels like you should be able to treat a Categorical as providing an encoding dictionary somehow (whether directly or by generating a Series) but I can't see a convenient way to do it.

Method #2: you could use rank("dense"), which generates an increasing number for each value in order:

>>> ranked = df.stack().rank("dense").reshape(df.shape).astype(int)-1
>>> df.join(pd.DataFrame(ranked, columns=["C", "D"]))
       A       B  C  D
0   John     Tom  2  5
1  Homer    Bart  1  0
2    Tom  Maggie  5  4
3   Lisa    John  3  2

Upvotes: 3

Andy Hayden
Andy Hayden

Reputation: 375445

Here's one way. First get the array of unique names:

In [11]: df.values.ravel()
Out[11]: array(['John', 'Tom', 'Homer', 'Bart', 'Tom', 'Maggie', 'Lisa', 'John'], dtype=object)

In [12]: pd.unique(df.values.ravel())
Out[12]: array(['John', 'Tom', 'Homer', 'Bart', 'Maggie', 'Lisa'], dtype=object)

and make this a Series, mapping names to their respective numbers:

In [13]: names = pd.unique(df.values.ravel())

In [14]: names = pd.Series(np.arange(len(names)), names)

In [15]: names
Out[15]:
John      0
Tom       1
Homer     2
Bart      3
Maggie    4
Lisa      5
dtype: int64

Now use applymap and names.get to lookup these numbers:

In [16]: df.applymap(names.get)
Out[16]:
   A  B
0  0  1
1  2  3
2  1  4
3  5  0

and assign it to the correct columns:

In [17]: df[["C", "D"]] = df.applymap(names.get)

In [18]: df
Out[18]:
       A       B  C  D
0   John     Tom  0  1
1  Homer    Bart  2  3
2    Tom  Maggie  1  4
3   Lisa    John  5  0

Note: This assumes that all the values are names to begin with, you may want to restrict this to some columns only:

df[['A', 'B']].values.ravel()
...
df[['A', 'B']].applymap(names.get)

Upvotes: 6

Related Questions