Anakin Skywalker
Anakin Skywalker

Reputation: 2520

Identify duplicate rows with pandas and convert such rows into one row, creating a new columns as a result

I have a dataset of stations

    map_id  longitude   latitude  zip_code
0   40830 -87.669147  41.857908     60608
1   40830 -87.669147  41.857908     60608
2   40120 -87.680622  41.829353     60609
3   40120 -87.680622  41.829353     60609
4   41120 -87.625826  41.831677     60616

As you can see, the first four rows are duplications and it is not an accident. They are the same stations, which are treated as separate stations of different lines.

I would like to eliminate such duplicates (it can be 2 or even 5 rows for some stations) and treat it as one station.

Moreover, I would like to create a new column "Hub", where aggregated rows will be treated a hub station. For example, as a boolean (0 for a regular station, 1 for a hub).

The desired output for the sample above with two cases of duplication -> transformed into 3 rows with 2 hubs.

    map_id  longitude   latitude  zip_code    hub
0   40830 -87.669147  41.857908     60608      1
1   40120 -87.680622  41.829353     60609      1
1   41120 -87.625826  41.831677     60616      0

I appreciate any tips!

Upvotes: 1

Views: 90

Answers (2)

wwnde
wwnde

Reputation: 26676

Looks to me like you want to drop duplicates and assign certain zipcodes as hub. If so, I would drop duplicates and use np.where to assign hubs. I included a non existent opcode to demonstrate how you can do this if more than one zipcode is designated as a hub

import numpy as np



df2=df.drop_duplicates(subset=['map_id','longitude','latitude','zip_code'], keep='first')
conditions=df2['zip_code'].isin(['60616','60619'])
df2['hub']=np.where(conditions,0,1)

enter image description here

Upvotes: 2

Chris
Chris

Reputation: 16147

df = df.groupby(['map_id','longitude','latitude','zip_code']).size().reset_index(name='hub')

df['hub'] = df['hub'].replace(1,0).apply(lambda x:min(x,1))

Output

   map_id  longitude   latitude  zip_code  hub
0   40120 -87.680622  41.829353     60609    1
1   40830 -87.669147  41.857908     60608    1
2   41120 -87.625826  41.831677     60616    0

Upvotes: 1

Related Questions