Reputation: 1717
How to get all the existing duplicated sets of records(based on a column) from a dataframe?
I got a dataframe as follows:
flight_id | from_location | to_location | schedule |
1 | Vancouver | Toronto | 3-Jan |
2 | Amsterdam | Tokyo | 15-Feb |
4 | Fairbanks | Glasgow | 12-Jan |
9 | Halmstad | Athens | 21-Jan |
3 | Brisbane | Lisbon | 4-Feb |
4 | Johannesburg | Venice | 23-Jan |
9 | LosAngeles | Perth | 3-Mar |
Here flight_id is the column on which I need to check duplicates. And there are 2 sets of duplicates.
Output for this specific example should look like--[(2,5),(3,6)]
. List of tuples of record index values
Upvotes: 11
Views: 2370
Reputation: 294278
Using apply
and a lambda
df.groupby('flight_id').apply(
lambda d: tuple(d.index) if len(d.index) > 1 else None
).dropna()
flight_id
4 (2, 5)
9 (3, 6)
dtype: object
Or better with an iteration through the groupby
object
{k: tuple(d.index) for k, d in df.groupby('flight_id') if len(d) > 1}
{4: (2, 5), 9: (3, 6)}
Just the tuples
[tuple(d.index) for k, d in df.groupby('flight_id') if len(d) > 1]
[(2, 5), (3, 6)]
Leaving this for posterity
But I now highly dislike this approach. It's just too gross.
I was messing around with itertools.groupby
Others may find this fun
from itertools import groupby
key = df.flight_id.get
s = sorted(df.index, key=key)
dict(filter(
lambda t: len(t[1]) > 1,
((k, tuple(g)) for k, g in groupby(s, key))
))
{4: (2, 5), 9: (3, 6)}
Upvotes: 8
Reputation: 323266
Is this what you need ? duplicated
+groupby
(df.loc[df['flight_id'].duplicated(keep=False)].reset_index()).groupby('flight_id')['index'].apply(tuple)
Out[510]:
flight_id
4 (2, 5)
9 (3, 6)
Name: index, dtype: object
Adding tolist
at the end
(df.loc[df['flight_id'].duplicated(keep=False)].reset_index()).groupby('flight_id')['index'].apply(tuple).tolist()
Out[511]: [(2, 5), (3, 6)]
And another solution ... for fun only
s=df['flight_id'].value_counts()
list(map(lambda x : tuple(df[df['flight_id']==x].index.tolist()), s[s.gt(1)].index))
Out[519]: [(2, 5), (3, 6)]
Upvotes: 9
Reputation: 402523
Performing a groupby
on df.index
can take you places.
v = df.index.to_series().groupby(df.flight_id).apply(pd.Series.tolist)
v[v.str.len().gt(1)]
flight_id
4 [2, 5]
9 [3, 6]
dtype: object
You can also get cute with just groupby
on df.index
directly.
v = pd.Series(df.index.groupby(df.flight_id))
v[v.str.len().gt(1)].to_dict()
{
"4": [
2,
5
],
"9": [
3,
6
]
}
Upvotes: 6