How to identify duplicates and return duplicates in a new column in python using pandas

Question

I have a large dataframe that has location names tagged to GPS coordinates. However, multiple locations are tagged onto one GPS coordinates, and that should not be the case. I would like to create a new column in this dataframe that lists out the other locations tagged onto the GPS coordinates.

I know this is probably overkill, but currently, I'm using a cKDTree to return the other locations tagged onto the GPS. However, I'm only getting half the picture, and I'm not sure where the hiccups are coming from.

df_inventory['has_duplicates']= df_inventory.duplicated(subset=['lat,'lng'], keep=False)

for x in inventory_points:
    inventory_index = inventory_tree.query_ball_point(x, 0)

    if len(inventory_index)>1:
        distance, index_points = inventory_tree.query(x, len(inventory_index))

        for y in index_points:
            df_inventory.duplicates.iloc[inventory_points.index(x)] = np.append(df_inventory.duplicates.iloc[inventory_points.index(x)], df_inventory.location.iloc[y])
            df_inventory.duplicates.iloc[inventory_points.index(x)] = np.unique(df_inventory.duplicates.iloc[inventory_points.index(x)])

This is what I expect to see.

+----------+-----+-----+----------------+------------+
| location | lat | lng | has_duplicates | duplicates |
+----------+-----+-----+----------------+------------+
| loc1     | 1   | 2   | T              | loc1, loc2 |
+----------+-----+-----+----------------+------------+
| loc2     | 1   | 2   | T              | loc1, loc2 |
+----------+-----+-----+----------------+------------+
| loc3     | 2   | 1   | T              | loc3, loc5 |
+----------+-----+-----+----------------+------------+
| loc4     | 3   | 4   | F              | None       |
+----------+-----+-----+----------------+------------+
| loc5     | 2   | 1   | T              | loc3, loc5|
+----------+-----+-----+----------------+------------+
| loc6     | 5   | 4   | F              | None       |
+----------+-----+-----+----------------+------------+

However, this is what I'm getting.

+----------+-----+-----+----------------+------------+
| location | lat | lng | has_duplicates | duplicates |
+----------+-----+-----+----------------+------------+
| loc1     | 1   | 2   | T              | loc1, loc2 |
+----------+-----+-----+----------------+------------+
| loc2     | 1   | 2   | T              |            |
+----------+-----+-----+----------------+------------+
| loc3     | 2   | 1   | T              | loc3, loc5 |
+----------+-----+-----+----------------+------------+
| loc4     | 3   | 4   | F              | None       |
+----------+-----+-----+----------------+------------+
| loc5     | 2   | 1   | T              |            |
+----------+-----+-----+----------------+------------+
| loc6     | 5   | 4   | F              | None       |
+----------+-----+-----+----------------+------------+

loc2 should have loc1, loc2 listed as duplicates; and loc5 should have loc3, loc5 listed as duplicates, but it doesn't.

UPDATE

index() will just return the first occurrence of an item in a list and doesn't work well with lists that contain duplicates. I ended up enumerating the list in another list and then iterated through the enumerated list to reference the desired list. It's still a pretty clunky work around, but if anyone has a better way to go through it, please let me know!

Terry · Accepted Answer

IIUC you can try:

df_inventory['duplicates'] = df_inventory.groupby(['lat','lng'])['location'] \
                              .transform(lambda x: ', '.join(x) if len(x) > 1 else None)

output:


location    lat lng duplicates
0   loc1    1   2   loc1, loc2
1   loc2    1   2   loc1, loc2
2   loc3    2   1   loc3, loc5
3   loc4    3   4   NaN
4   loc5    2   1   loc3, loc5
5   loc6    5   4   NaN

How to identify duplicates and return duplicates in a new column in python using pandas

Answers (2)

Related Questions