How to select rows values starting by specific letters by group in a python dataframe?

Question

I have the following dataframe "data" composed of ID and associated cluster number :

ID      cluster 
FP_101   1  
FP_102   1     
SP_209   3
SP_300   3
SP_209   1
FP_45    90
SP_50    90
FP_398   100
...

I would like to print clusters which contain more than one ID starting by SP and/or FP. I think that I have the two parts of the answer but just do not know of to combine them in propre way :

data = data[data['ID'].str.startswith('FP')] (same for SP)
selection fonction : data = data.groupby(['cluster']).filter(lambda x: x['ID'].nunique() > 1)

The result should give from the previous example :

    ID      cluster 
    FP_101   1  
    FP_102   1
    SP_209   1     
    SP_209   3
    SP_300   3

How can I combine arrange these fonction to obtain this result ?

Megha John · Accepted Answer

This is my understanding of your question; let me know if it helps:

Separating SP & FP

df['Prefix'] = df['ID'].apply(lambda x: x.split('_')[0])

Grouping by clusters

df2 = df.groupby(['cluster', 'Prefix'], as_index = False).agg({'ID':['nunique','unique']})

Filtering

df2.columns = df2.columns.to_flat_index().str.join('')

df2[df2['IDnunique']>1]

How to select rows values starting by specific letters by group in a python dataframe?

Answers (1)

Related Questions