Stergios
Stergios

Reputation: 3196

Keeping rows in Pandas where the same ID appears more than n times and convert to list per ID

I have a DataFrame like below:

     VID  value
1     1    xx
2     2    xx1
3     2    xx2
4     2    xx3
5     2    xx4
6     3    xx
7     3    xx
8     3    xx
9     4    zz1
10    4    zz2
11    4    zz3
12    4    zz4
13    4    zz5

I want to keep only the rows whose VID exists in more than n=3 cases. In the above example, I'd keep only rows 2-5 and 9-13 (since only the VIDs 2 and 4 appear more than 3 times). The 'value' field is irrelevant.

After keeping the needed rows, I'd like to convert my data as a list of lists while appending an 'end' value at the end of each one:

[[xx1, xx2, xx3, xx4, 'end'], [zz1, zz2, zz3, zz4, zz5, 'end']]

Is it possible to get the above without a for loop?

Upvotes: 6

Views: 2917

Answers (1)

Anand S Kumar
Anand S Kumar

Reputation: 90979

You can group by VID column and then take count of each group. Then after that use it for indexing your original df, so as to only take rows that have greater than 3 count. Example -

countdf = df.groupby('VID').count()
result = df.loc[df['VID'].isin(countdf[countdf['value'] > 3].index)]

Demo -

In [49]: df
Out[49]:
    VID value
1     1    xx
2     2   xx1
3     2   xx2
4     2   xx3
5     2   xx4
6     3    xx
7     3    xx
8     3    xx
9     4   zz1
10    4   zz2
11    4   zz3
12    4   zz4
13    4   zz5

In [51]: df.groupby('VID').count()
Out[51]:
     value
VID
1        1
2        4
3        3
4        5

In [52]: countdf = df.groupby('VID').count()

In [53]: df.loc[df['VID'].isin(countdf[countdf['value'] > 3].index)]
Out[53]:
    VID value
2     2   xx1
3     2   xx2
4     2   xx3
5     2   xx4
9     4   zz1
10    4   zz2
11    4   zz3
12    4   zz4
13    4   zz5

Then after this, you can again group based on VID and then convert the groups into list and then again to list , Example -

resultlist = result.groupby('VID')['value'].apply(list).tolist()

Demo -

In [54]: result = df.loc[df['VID'].isin(countdf[countdf['value'] > 3].index)]

In [55]: result.groupby('VID')['value'].apply(list).tolist()
Out[55]: [['xx1', 'xx2', 'xx3', 'xx4'], ['zz1', 'zz2', 'zz3', 'zz4', 'zz5']]

Please note, above would not get you the 'end' value in the list, I am guessing that is not necessary, but if you really want that, you can manually add it after getting the list. Example -

resultlist = [elem + ['end'] for elem in resultlist]

Upvotes: 6

Related Questions