Reputation: 154
I have the following dataframe "data" composed of ID and associated cluster number :
ID cluster
FP_101 1
FP_102 1
SP_209 3
SP_300 3
SP_209 1
FP_45 90
SP_50 90
FP_398 100
...
I would like to print clusters which contain more than one ID starting by SP and/or FP. I think that I have the two parts of the answer but just do not know of to combine them in propre way :
The result should give from the previous example :
ID cluster
FP_101 1
FP_102 1
SP_209 1
SP_209 3
SP_300 3
How can I combine arrange these fonction to obtain this result ?
Upvotes: 0
Views: 128
Reputation: 153
This is my understanding of your question; let me know if it helps:
df['Prefix'] = df['ID'].apply(lambda x: x.split('_')[0])
df2 = df.groupby(['cluster', 'Prefix'], as_index = False).agg({'ID':['nunique','unique']})
df2.columns = df2.columns.to_flat_index().str.join('')
df2[df2['IDnunique']>1]
Upvotes: 1