EduinJBL
EduinJBL

Reputation: 11

(pandas)How can I create a unique identifier based on three similar columns of data, where order doesn't matter?

(Python/Pandas) I'm doing some analysis on UK House Price data looking at whether house prices respond to quality of nearby schools. I have matched the URN (Unique Reference Number) of the three nearest schools to each house price transaction.These are columns URN_1, URN_2, URN_3 in the data.

I would like to estimate a fixed effects model on the data, where the fixed effects are based on the three nearest schools. I therefore want to create a unique ID for each cluster of three schools and I want this to be unaffected by the order of the schools .e.g. Property A and Property B should have the same ID, despite the different order of the schools.

Property    URN_1   URN_2   URN_3
A         100053   100052   100054
B         100052   100054   100053

Does anyone know how I create unique cluster ids using Python?

I've tried using .groupby() to create the ID with the code below, but this gives different cluster ids, when the order of schools are different.

Here is what I have tried:

import pandas as pd
URN1=[1,2,3,4,5]
URN2=[5,4,3,2,1]
URN3=[1,2,3,2,1]
lst=['a','b','c','d','e']
df=pd.DataFrame(list(zip(URN1,URN2,URN3)),
columns['URN_1','URN_2','URN_3'],index=lst)
df['clusterid']=df.groupby(['URN_1','URN_2','URN_3']).ngroup()
print(df)

I'd want to have observations 'a' and 'e' have the same cluster id, but they are given different ids by this method.

Upvotes: 1

Views: 483

Answers (3)

harpan
harpan

Reputation: 8631

Use factorize on the unique string-like object of the combinations. Since the order does not matter, we sort it first and combine it.

df['clusterid'] = pd.factorize(df[['URN_1','URN_2','URN_3']].apply(lambda x: ','.join([str(y) for y in sorted(x)]),1))[0]

Output:

       URN_1  URN_2  URN_3  clusterid  clisterid
a      1      5      1          0          0
b      2      4      2          1          1
c      3      3      3          2          2
d      4      2      2          3          1
e      5      1      1          4          0

Upvotes: 0

Quang Hoang
Quang Hoang

Reputation: 150735

This works if your data is not too long:

# we sort the values of each row
# and turn them to tuples
markers = (df[['URN_1','URN_2','URN_3']]
             .apply(lambda x: tuple(sorted(x.values)), axis=1)
          )

df['clisterid'] = df.groupby(markers).ngroup()

Output:

  Property   URN_1   URN_2   URN_3  clisterid
0        A  100053  100052  100054          0
1        B  100052  100054  100053          0

Option 2: since the above solution uses apply, which might not be ideal in some cases. Here's a little math trick: it's known that a group (a,b,c) is uniquely defined (up to a permutation) by (a+b+c, a**2+b**2+c**2, abc). So we can compute those values and group by them:

tmp_df = df[['URN_1','URN_2','URN_3']]

s = tmp_df.sum(1)         # sums
sq = (tmp_df**2).sum(1)   # sum of squares
p = tmp_df.prod(1)        # products

# groupby
df['clisterid'] = df.groupby([s,sq,p]).ngroup()

Performance: The first approach takes 14s to process 2 million rows, while the 2nd takes less than 1 second.

Upvotes: 1

jtremoureux
jtremoureux

Reputation: 28

You can create a string for each using the 3 URNs sorted.
Then group by this new variable and use ngroup() as you tried before

df['URN_join'] = df[['URN_1','URN_2','URN_3']].apply(lambda x: '_'.join([str(nb) for nb in sorted(x)]), axis=1)
df['clusterid'] = df.groupby(['URN_join']).ngroup()
df

Output :

    URN_1   URN_2   URN_3   clusterid   URN_join
a   1       5       1       0           1_1_5
b   2       4       2       1           2_2_4
c   3       3       3       2           3_3_3
d   4       2       2       1           2_2_4
e   5       1       1       0           1_1_5

Upvotes: 0

Related Questions