eagerstudent
eagerstudent

Reputation: 277

How to create a unique identifier based on multiple columns?

I have a pandas dataframe that looks something like this:

    brand       description     former_price    discounted_price
0   A           icecream        1099.0          855.0   
1   A           cheese          469.0           375.0   
2   B           catfood         179.0           119.0   
3   C           NaN             699.0           399.0   
4   NaN         icecream        769.0           549.0
5   A           icecream        769.0           669.0   

I want to create a column that will assign a unique value for each brand & description combination. Note that either the brand or the description can be missing from the dataset (notified by NaN value). Also, note that if the brand and the description is the same (duplicated) I still want the unique value to be the same for the row.

The output should look like this:

    product_key   brand         description     former_price    discounted_price
0   1             A             icecream        1099.0          855.0   
1   2             A             cheese          469.0           375.0   
2   3             B             catfood         179.0           119.0   
3   4             C             NaN             699.0           399.0   
4   5             NaN           icecream        769.0           549.0
5   1             A             icecream        769.0           669.0   

The values in product_key can be anything, I just want them to be unique based on brand and description columns. Any help is immensely appreciated!

Thanks a lot!

Upvotes: 8

Views: 5674

Answers (2)

MrNobody33
MrNobody33

Reputation: 6483

You could try with pd.Series.factorize:

df.set_index(['brand','description']).index.factorize()[0]+1

Output:

0    1
1    2
2    3
3    4
4    5
5    1

So you could try this, to assign it to be the first column:

df.insert(loc=0, column='product_key', value=df.set_index(['brand','description']).index.factorize()[0]+1)

Output:

df
   product_key brand description  former_price  discounted_price
0            1     A    icecream        1099.0             855.0
1            2     A      cheese         469.0             375.0
2            3     B     catfood         179.0             119.0
3            4     C         NaN         699.0             399.0
4            5   NaN    icecream         769.0             549.0
5            1     A    icecream         769.0             669.0

Upvotes: 10

anky
anky

Reputation: 75080

with groupby+ngroup:

(df.fillna({'brand':'','description':''})
   .groupby(['brand','description'],sort=False).ngroup()+1)

0    1
1    2
2    3
3    4
4    5
5    1

Upvotes: 4

Related Questions