Reputation: 11
(Python/Pandas) I'm doing some analysis on UK House Price data looking at whether house prices respond to quality of nearby schools. I have matched the URN (Unique Reference Number) of the three nearest schools to each house price transaction.These are columns URN_1, URN_2, URN_3 in the data.
I would like to estimate a fixed effects model on the data, where the fixed effects are based on the three nearest schools. I therefore want to create a unique ID for each cluster of three schools and I want this to be unaffected by the order of the schools .e.g. Property A and Property B should have the same ID, despite the different order of the schools.
Property URN_1 URN_2 URN_3
A 100053 100052 100054
B 100052 100054 100053
Does anyone know how I create unique cluster ids using Python?
I've tried using .groupby() to create the ID with the code below, but this gives different cluster ids, when the order of schools are different.
Here is what I have tried:
import pandas as pd
URN1=[1,2,3,4,5]
URN2=[5,4,3,2,1]
URN3=[1,2,3,2,1]
lst=['a','b','c','d','e']
df=pd.DataFrame(list(zip(URN1,URN2,URN3)),
columns['URN_1','URN_2','URN_3'],index=lst)
df['clusterid']=df.groupby(['URN_1','URN_2','URN_3']).ngroup()
print(df)
I'd want to have observations 'a' and 'e' have the same cluster id, but they are given different ids by this method.
Upvotes: 1
Views: 483
Reputation: 8631
Use factorize on the unique string-like object of the combinations. Since the order does not matter, we sort it first and combine it.
df['clusterid'] = pd.factorize(df[['URN_1','URN_2','URN_3']].apply(lambda x: ','.join([str(y) for y in sorted(x)]),1))[0]
Output:
URN_1 URN_2 URN_3 clusterid clisterid
a 1 5 1 0 0
b 2 4 2 1 1
c 3 3 3 2 2
d 4 2 2 3 1
e 5 1 1 4 0
Upvotes: 0
Reputation: 150735
This works if your data is not too long:
# we sort the values of each row
# and turn them to tuples
markers = (df[['URN_1','URN_2','URN_3']]
.apply(lambda x: tuple(sorted(x.values)), axis=1)
)
df['clisterid'] = df.groupby(markers).ngroup()
Output:
Property URN_1 URN_2 URN_3 clisterid
0 A 100053 100052 100054 0
1 B 100052 100054 100053 0
Option 2: since the above solution uses apply
, which might not be ideal in some cases. Here's a little math trick: it's known that a group (a,b,c)
is uniquely defined (up to a permutation) by (a+b+c, a**2+b**2+c**2, abc)
. So we can compute those values and group by them:
tmp_df = df[['URN_1','URN_2','URN_3']]
s = tmp_df.sum(1) # sums
sq = (tmp_df**2).sum(1) # sum of squares
p = tmp_df.prod(1) # products
# groupby
df['clisterid'] = df.groupby([s,sq,p]).ngroup()
Performance: The first approach takes 14s to process 2 million rows, while the 2nd takes less than 1 second.
Upvotes: 1
Reputation: 28
You can create a string for each using the 3 URNs sorted.
Then group by this new variable and use ngroup() as you tried before
df['URN_join'] = df[['URN_1','URN_2','URN_3']].apply(lambda x: '_'.join([str(nb) for nb in sorted(x)]), axis=1)
df['clusterid'] = df.groupby(['URN_join']).ngroup()
df
Output :
URN_1 URN_2 URN_3 clusterid URN_join
a 1 5 1 0 1_1_5
b 2 4 2 1 2_2_4
c 3 3 3 2 3_3_3
d 4 2 2 1 2_2_4
e 5 1 1 0 1_1_5
Upvotes: 0