Identify duplicate rows with pandas and convert such rows into one row, creating a new columns as a result

Question

I have a dataset of stations

    map_id  longitude   latitude  zip_code
0   40830 -87.669147  41.857908     60608
1   40830 -87.669147  41.857908     60608
2   40120 -87.680622  41.829353     60609
3   40120 -87.680622  41.829353     60609
4   41120 -87.625826  41.831677     60616

As you can see, the first four rows are duplications and it is not an accident. They are the same stations, which are treated as separate stations of different lines.

I would like to eliminate such duplicates (it can be 2 or even 5 rows for some stations) and treat it as one station.

Moreover, I would like to create a new column "Hub", where aggregated rows will be treated a hub station. For example, as a boolean (0 for a regular station, 1 for a hub).

The desired output for the sample above with two cases of duplication -> transformed into 3 rows with 2 hubs.

    map_id  longitude   latitude  zip_code    hub
0   40830 -87.669147  41.857908     60608      1
1   40120 -87.680622  41.829353     60609      1
1   41120 -87.625826  41.831677     60616      0

I appreciate any tips!

Chris · Accepted Answer

df = df.groupby(['map_id','longitude','latitude','zip_code']).size().reset_index(name='hub')

df['hub'] = df['hub'].replace(1,0).apply(lambda x:min(x,1))

Output

   map_id  longitude   latitude  zip_code  hub
0   40120 -87.680622  41.829353     60609    1
1   40830 -87.669147  41.857908     60608    1
2   41120 -87.625826  41.831677     60616    0

Identify duplicate rows with pandas and convert such rows into one row, creating a new columns as a result

Answers (2)

Related Questions