Reputation: 317
I have a dataframe with monthly records for different IDs, and I need to do some analysis only on the IDs that have multiple months of records.
ID Month Metric1 Metric2
1 2018-01-01 4 3
1 2018-02-01 3 2
2 2018-02-01 1 5
3 2018-01-01 4 2
3 2018-02-01 6 3
4 2018-01-01 3 1
How would I filter out the rows of ID that only appear once and keep those with multiple rows and get a result like
ID Month Metric1 Metric2
1 2018-01-01 4 3
1 2018-02-01 3 2
3 2018-01-01 4 2
3 2018-02-01 6 3
I've looked at some other pages that mention using something like:
df = df[df.groupby('ID').ID.transform(len) > 1]
But I don't want to lose the metrics from each month by grouping.
Upvotes: 12
Views: 4411
Reputation: 294358
filter
I cannot vouche for the speed of this but this is what this API was intended for...
df.groupby('ID').filter(lambda d: len(d) > 1)
ID Month Metric1 Metric2
0 1 2018-01-01 4 3
1 1 2018-02-01 3 2
3 3 2018-01-01 4 2
4 3 2018-02-01 6 3
Numpy'd version of @Wen-Ben's answer
u, i = np.unique(df.ID.values, return_inverse=True)
df[np.bincount(i)[i] > 1]
ID Month Metric1 Metric2
0 1 2018-01-01 4 3
1 1 2018-02-01 3 2
3 3 2018-01-01 4 2
4 3 2018-02-01 6 3
Because I was curious...
s0 = set()
s1 = set()
for i in df.ID:
if i in s0:
s1.add(i)
s0.add(i)
df[df.ID.map(s1.__contains__)]
ID Month Metric1 Metric2
0 1 2018-01-01 4 3
1 1 2018-02-01 3 2
3 3 2018-01-01 4 2
4 3 2018-02-01 6 3
Upvotes: 3
Reputation: 323306
Change the len
to count
df[df.groupby('ID').ID.transform('count') > 1]
Out[589]:
ID Month Metric1 Metric2
0 1 2018-01-01 4 3
1 1 2018-02-01 3 2
3 3 2018-01-01 4 2
4 3 2018-02-01 6 3
Upvotes: 11
Reputation: 75080
Try with pd.series.duplicated()
:
df1=df[df.ID.duplicated(keep=False)]
print(df1)
ID Month Metric1 Metric2
0 1 2018-01-01 4 3
1 1 2018-02-01 3 2
3 3 2018-01-01 4 2
4 3 2018-02-01 6 3
Upvotes: 6