Reputation: 3345
I have a dataset that looks like this:
postcode house_number col2 col3
xxx xxx xxx xxx
xxx xxx xxx xxx
I want to group the data by postcode
and house_number
, if two rows have the same postcode and house_number, it means they are the same property, then I want to construct a unique_id
for each property (in other words, for a unique_id
, the postcode
/ house_number
must be the same, but the value for col2
/ col3
might be different), something like:
unique_id postcode house_number col2 col3
0 111 222 xxx xxx
0 111 222 xxx xxx
1 xxx xxx xxx xxx
.....
I tried new_df = ppd_df.groupby(['postcode','house_number']).reset_index()
but it gave me error AttributeError: 'DataFrameGroupBy' object has no attribute 'reset_index'
, also I'm not sure how to construct the column unique_id
. Can someone help please? Thanks.
Upvotes: 0
Views: 186
Reputation: 35626
Use groupby ngroup
to "[n]umber each group from 0 to the number of groups - 1":
df['unique_id'] = df.groupby(['postcode', 'house_number']).ngroup()
df
:
postcode house_number col2 col3 unique_id
0 111 222 1 5 0
1 111 222 2 6 0
2 111 444 3 7 1
3 333 333 4 8 2
insert
can be used to make it the first column if needed:
df.insert(0, 'unique_id', df.groupby(['postcode', 'house_number']).ngroup())
df
:
unique_id postcode house_number col2 col3
0 0 111 222 1 5
1 0 111 222 2 6
2 1 111 444 3 7
3 2 333 333 4 8
*Note: sort=False
can be used to ensure that groups are enumerated in the order in which they appear in the DataFrame:
df['unique_id'] = df.groupby(['postcode', 'house_number'], sort=False).ngroup()
Or
df.insert(0, 'unique_id',
df.groupby(['postcode', 'house_number'], sort=False).ngroup())
Otherwise groups will be enumerated in the "order in which the groups would be seen when iterating over the groupby object, not the order they are first observed."
DataFrame and imports:
import pandas as pd
df = pd.DataFrame({
'postcode': [111, 111, 111, 333],
'house_number': [222, 222, 444, 333],
'col2': [1, 2, 3, 4],
'col3': [5, 6, 7, 8],
})
Upvotes: 1
Reputation: 120391
Create tuple from postcode
and house_number
columns and use pd.factorize
to get a unique identifier:
df['unique_id'] = \
pd.factorize(df[['postcode', 'house_number']].apply(tuple, axis=1))[0]
>>> df
postcode house_number col2 col3 unique_id
0 111 222 xxx xxx 0
1 111 333 xxx xxx 1
2 111 222 xxx xxx 0
3 111 222 xxx xxx 0
4 222 444 xxx xxx 2
Upvotes: 0