Reputation: 1282

Pandas - Giving all rows (particularly) duplicate rows a unique identifier

Let's say I have a DF with 5 columns and I want to make a unique 'key' for each row.

   a  b  c  d  e
1  1  2  3  4  5
2  1  2  3  4  6
3  1  2  3  4  7 
4  1  2  2  5  6
5  2  3  4  5  6
6  2  3  4  5  6
7  3  4  5  6  7

I'd like to create a 'key' column as follows:

   a  b  c  d  e  key
1  1  2  3  4  5  12345
2  1  2  3  4  6  12346
3  1  2  3  4  7  12347
4  1  2  2  5  6  12256
5  2  3  4  5  6  23456
6  2  3  4  5  6  23456
7  3  4  5  6  7  34567

Now the problem with this of course is that row 5 & 6 are duplicates.

I'd like to be able to create unique keys like so:

   a  b  c  d  e  key
1  1  2  3  4  5  12345_1
2  1  2  3  4  6  12346_1
3  1  2  3  4  7  12347_1
4  1  2  2  5  6  12256_1
5  2  3  4  5  6  23456_1
6  2  3  4  5  6  23456_2
7  3  4  5  6  7  34567_1

Not sure how to do this or if this is the best method - appreciate any help.

Thanks

Edit: Columns will be mostly strings, not numeric.

Upvotes: 1

Answers (4)

Mohamed Thasin ah

Reputation: 11192

try this.,

df['key']=df.apply(lambda x:'-'.join(x.values.tolist()),axis=1)
m=~df['key'].duplicated()
s= (df.groupby(m.cumsum()).cumcount()+1).astype(str)
df['key']=df['key']+'_'+s
print (df)

O/P:

   a  b  c  d  e          key
0  1  2  3  4  5  1-2-3-4-5_0
1  1  2  3  4  6  1-2-3-4-6_0
2  1  2  3  4  7  1-2-3-4-7_0
3  1  2  2  5  6  1-2-2-5-6_0
4  2  3  4  5  6  2-3-4-5-6_0
5  2  3  4  5  6  2-3-4-5-6_1
6  3  4  5  6  7  3-4-5-6-7_0
7  1  2  3  4  5  1-2-3-4-5_1

Another much simpler way:

df['key']=df['key']+'_'+(df.groupby('key').cumcount()).astype(str)

Explanation:

first create your unique id using join.
create a sequence s using duplicate and perform cumsum, restart when new value found.
finally concat key and your sequence s.

Upvotes: 2

James Dellinger

Reputation: 1261

Another approach would be to use np.random.choice(range(10000,99999), len(df), replace=False) to generate unique random numbers without replacement for each row in your df:

df = pd.DataFrame(columns = ['a', 'b', 'c', 'd', 'e'],
                  data = [[1,  2,  3,  4,  5],[1,  2,  3,  4,  6],[1,  2,  3,  4,  7],[1,  2,  2,  5,  6],[2,  3,  4,  5,  6],[2,  3,  4,  5,  6],[3,  4,  5,  6,  7]])

df['key'] = np.random.choice(range(10000,99999), len(df), replace=False)

df

    a   b   c   d   e   key
0   1   2   3   4   5   10560
1   1   2   3   4   6   79547
2   1   2   3   4   7   24762
3   1   2   2   5   6   95221
4   2   3   4   5   6   79460
5   2   3   4   5   6   62820
6   3   4   5   6   7   82964

Upvotes: 0

Andy Hayden

Reputation: 375925

On way is to hash to tuple of each row:

In [11]: df.apply(lambda x: hash(tuple(x)), axis=1)
Out[11]:
1   -2898633648302616629
2   -2898619338595901633
3   -2898621714079554433
4   -9151203046966584651
5    1657626630271466437
6    1657626630271466437
7    3771657657075408722
dtype: int64

In [12]: df['key'] = df.apply(lambda x: hash(tuple(x)), axis=1)

In [13]: df['key'].astype(str) + '_' + (df.groupby('key').cumcount() + 1).astype(str)
Out[13]:
1    -2898633648302616629_1
2    -2898619338595901633_1
3    -2898621714079554433_1
4    -9151203046966584651_1
5     1657626630271466437_1
6     1657626630271466437_2
7     3771657657075408722_1
dtype: object

Note: Generally you don't need to be doing this (it's unclear why you'd want to!).

Upvotes: 4

Jeril

Reputation: 8571

Maybe you can do something link the following

import uuid
df['uuid'] = [uuid.uuid4() for __ in range(df.index.size)]

Upvotes: 0

Pandas - Giving all rows (particularly) duplicate rows a unique identifier

Answers (4)

Related Questions