Reputation: 5
I have a large dataframe that has location names tagged to GPS coordinates. However, multiple locations are tagged onto one GPS coordinates, and that should not be the case. I would like to create a new column in this dataframe that lists out the other locations tagged onto the GPS coordinates.
I know this is probably overkill, but currently, I'm using a cKDTree to return the other locations tagged onto the GPS. However, I'm only getting half the picture, and I'm not sure where the hiccups are coming from.
df_inventory['has_duplicates']= df_inventory.duplicated(subset=['lat,'lng'], keep=False)
for x in inventory_points:
inventory_index = inventory_tree.query_ball_point(x, 0)
if len(inventory_index)>1:
distance, index_points = inventory_tree.query(x, len(inventory_index))
for y in index_points:
df_inventory.duplicates.iloc[inventory_points.index(x)] = np.append(df_inventory.duplicates.iloc[inventory_points.index(x)], df_inventory.location.iloc[y])
df_inventory.duplicates.iloc[inventory_points.index(x)] = np.unique(df_inventory.duplicates.iloc[inventory_points.index(x)])
This is what I expect to see.
+----------+-----+-----+----------------+------------+
| location | lat | lng | has_duplicates | duplicates |
+----------+-----+-----+----------------+------------+
| loc1 | 1 | 2 | T | loc1, loc2 |
+----------+-----+-----+----------------+------------+
| loc2 | 1 | 2 | T | loc1, loc2 |
+----------+-----+-----+----------------+------------+
| loc3 | 2 | 1 | T | loc3, loc5 |
+----------+-----+-----+----------------+------------+
| loc4 | 3 | 4 | F | None |
+----------+-----+-----+----------------+------------+
| loc5 | 2 | 1 | T | loc3, loc5|
+----------+-----+-----+----------------+------------+
| loc6 | 5 | 4 | F | None |
+----------+-----+-----+----------------+------------+
However, this is what I'm getting.
+----------+-----+-----+----------------+------------+
| location | lat | lng | has_duplicates | duplicates |
+----------+-----+-----+----------------+------------+
| loc1 | 1 | 2 | T | loc1, loc2 |
+----------+-----+-----+----------------+------------+
| loc2 | 1 | 2 | T | |
+----------+-----+-----+----------------+------------+
| loc3 | 2 | 1 | T | loc3, loc5 |
+----------+-----+-----+----------------+------------+
| loc4 | 3 | 4 | F | None |
+----------+-----+-----+----------------+------------+
| loc5 | 2 | 1 | T | |
+----------+-----+-----+----------------+------------+
| loc6 | 5 | 4 | F | None |
+----------+-----+-----+----------------+------------+
loc2 should have loc1, loc2 listed as duplicates; and loc5 should have loc3, loc5 listed as duplicates, but it doesn't.
UPDATE
index() will just return the first occurrence of an item in a list and doesn't work well with lists that contain duplicates. I ended up enumerating the list in another list and then iterated through the enumerated list to reference the desired list. It's still a pretty clunky work around, but if anyone has a better way to go through it, please let me know!
Upvotes: 0
Views: 154
Reputation: 59
Does this answer your question ??
df_inventory = pd.DataFrame({
'location':['loc1','loc2','loc3','loc4','loc5','loc6'],
'lat':[1,1,2,3,2,5],
'lng':[2,2,1,4,1,4],
})
df_inventory['duplicates '] = df_inventory.groupby(['lat','lng'])['location'].transform(lambda x: (x+' ').sum() if len(x) > 1 else None)
output:
location lat lng duplicates
0 loc1 1 2 loc1 loc2
1 loc2 1 2 loc1 loc2
2 loc3 2 1 loc3 loc5
3 loc4 3 4 NaN
4 loc5 2 1 loc3 loc5
5 loc6 5 4 NaN
Sorry, I dont know how to you the result properly ... somebody may help me
Upvotes: 0
Reputation: 2811
IIUC you can try:
df_inventory['duplicates'] = df_inventory.groupby(['lat','lng'])['location'] \
.transform(lambda x: ', '.join(x) if len(x) > 1 else None)
output:
location lat lng duplicates
0 loc1 1 2 loc1, loc2
1 loc2 1 2 loc1, loc2
2 loc3 2 1 loc3, loc5
3 loc4 3 4 NaN
4 loc5 2 1 loc3, loc5
5 loc6 5 4 NaN
Upvotes: 1