Reputation: 2520
I have a dataset of stations
map_id longitude latitude zip_code
0 40830 -87.669147 41.857908 60608
1 40830 -87.669147 41.857908 60608
2 40120 -87.680622 41.829353 60609
3 40120 -87.680622 41.829353 60609
4 41120 -87.625826 41.831677 60616
As you can see, the first four rows are duplications and it is not an accident. They are the same stations, which are treated as separate stations of different lines.
I would like to eliminate such duplicates (it can be 2 or even 5 rows for some stations) and treat it as one station.
Moreover, I would like to create a new column "Hub", where aggregated rows will be treated a hub station. For example, as a boolean (0 for a regular station, 1 for a hub).
The desired output for the sample above with two cases of duplication -> transformed into 3 rows with 2 hubs.
map_id longitude latitude zip_code hub
0 40830 -87.669147 41.857908 60608 1
1 40120 -87.680622 41.829353 60609 1
1 41120 -87.625826 41.831677 60616 0
I appreciate any tips!
Upvotes: 1
Views: 90
Reputation: 26676
Looks to me like you want to drop duplicates and assign certain zipcodes as hub. If so, I would drop duplicates and use np.where
to assign hubs. I included a non existent opcode to demonstrate how you can do this if more than one zipcode is designated as a hub
import numpy as np
df2=df.drop_duplicates(subset=['map_id','longitude','latitude','zip_code'], keep='first')
conditions=df2['zip_code'].isin(['60616','60619'])
df2['hub']=np.where(conditions,0,1)
Upvotes: 2
Reputation: 16147
df = df.groupby(['map_id','longitude','latitude','zip_code']).size().reset_index(name='hub')
df['hub'] = df['hub'].replace(1,0).apply(lambda x:min(x,1))
Output
map_id longitude latitude zip_code hub
0 40120 -87.680622 41.829353 60609 1
1 40830 -87.669147 41.857908 60608 1
2 41120 -87.625826 41.831677 60616 0
Upvotes: 1