Rahul Agarwal
Rahul Agarwal

Reputation: 4100

Groupby on 2 columns plus filter on String columns

Sample DF:

ID   Name        Price     Condition   Fit_Test
1    Apple         10      Good        Super_Fit
2    Apple         10      OK          Super_Fit
3    Apple         10      Bad         Super_Fit
4    Orange        12      Good        Not_Fit
5    Orange        12      OK          Not_Fit
6    Banana        15      OK          Medium_Fit
7    Banana        15      Bad         Medium_Fit
8    Pineapple     25      OK          Medium_Fit
9    Pineapple     25      OK          Medium_Fit
10   Cherry        30      Bad         Medium_Fit

Expected DF:

ID   Name        Price     Condition   Fit_Test
1    Apple         10      Good        Super_Fit
2    Apple         10      OK          Super_Fit
3    Apple         10      Bad         Super_Fit

4    Orange        12      Good        Not_Fit

6    Banana        15      OK          Medium_Fit

8    Pineapple     25      OK          Medium_Fit
9    Pineapple     25      OK          Medium_Fit
10   Cherry        30      Bad         Medium_Fit

Problem Statement:

I want to group-by by Name and Price and then filter based on Condition.

  1. If within a Name and Price All 3 conditions of Good, Bad and OK exist then Keep only Good one and Fit_Test is not Super_Fit

  2. If within a Name and Price conditions of Good and OK exist then Keep only Good one (Id 4,5 is only ID 4 in expected) and Fit_Test is not Super_Fit

  3. If within a Name and Price conditions of Bad and OK exist then Keep only OK one (Id 6,7 is only ID 6 in expected) and Fit_Test is not Super_Fit

  4. If within a Name and Price conditions of OK and OK exist, Good and Good Exist or just Bad exist then don't do anything then Keep only OK one (Id 8,9,10 is ID 8,9,10 in expected) and Fit_Test is not Super_Fit

Update on Answer(s)

  1. The 1st answer and edit for testing works for all the df where you don't have a condition of Fit_Test column. In this answer the Expected DF will not have rows 2 & 3 as also shown in the answer
  2. The Edit for Update answer works when you need to take another columns which is Fit_Test and should only work when the value is not Super_Fit.

In both the solutions the Filtering of rows based on Condition column and group by of 2 columns is same.

I found things with filter + group by on the numeric columns but none on the String columns.

Upvotes: 3

Views: 82

Answers (2)

jezrael
jezrael

Reputation: 863226

Idea is create sets for comparing:

a = df.join(df.groupby(['Price','Name'])['Condition'].apply(set).rename('m'),
             on=['Price','Name'])['m']
print (a)
0    {Bad, Good, OK}
1    {Bad, Good, OK}
2    {Bad, Good, OK}
3         {Good, OK}
4         {Good, OK}
5          {Bad, OK}
6          {Bad, OK}
7               {OK}
8               {OK}
9              {Bad}
Name: m, dtype: object

m1 = (a == set({'Bad', 'Good', 'OK'})) | (a == set({'Good', 'OK'}))
m2 = a == set({'Bad', 'OK'})
#check if unique value - length of set is 1
m3 = a.str.len() == 1
m4 = df['Condition'] == 'Good'
m5 = df['Condition'] == 'OK'

df = df[(m1 & m4) | (m2 & m5) | m3]
print (df)
   ID       Name  Price Condition
0   1      Apple     10      Good
3   4     Orange     12      Good
5   6     Banana     15        OK
7   8  Pineapple     25        OK
8   9  Pineapple     25        OK
9  10     Cherry     30       Bad

EDIT for testing:

For testing is possible use assign:

print (df.assign(sets=a, m1 = m1, m2=m2, m3=m3, m4=m4, m5=m5, m=m))
   ID       Name  Price Condition             sets     m1     m2     m3  \
0   1      Apple     10      Good  {Bad, Good, OK}   True  False  False   
1   2      Apple     10        OK  {Bad, Good, OK}   True  False  False   
2   3      Apple     10       Bad  {Bad, Good, OK}   True  False  False   
3   4     Orange     12      Good       {Good, OK}   True  False  False   
4   5     Orange     12        OK       {Good, OK}   True  False  False   
5   6     Banana     15        OK        {Bad, OK}  False   True  False   
6   7     Banana     15       Bad        {Bad, OK}  False   True  False   
7   8  Pineapple     25        OK             {OK}  False  False   True   
8   9  Pineapple     25        OK             {OK}  False  False   True   
9  10     Cherry     30       Bad            {Bad}  False  False   True   

      m4     m5      m  
0   True  False   True  
1  False   True  False  
2  False  False  False  
3   True  False   True  
4  False   True  False  
5  False   True   True  
6  False  False  False  
7  False   True   True  
8  False   True   True  
9  False  False   True  

EDIT for update:

For new condition use:

    m6 = df['Fit_Test'] == 'Super_Fit'
    df = df[((m1 & m4) | (m2 & m5) | m3) | m6]
    print (df)
       ID       Name  Price Condition    Fit_Test
    0   1      Apple     10      Good   Super_Fit
    1   2      Apple     10        OK   Super_Fit
    2   3      Apple     10       Bad   Super_Fit
    3   4     Orange     12      Good     Not_Fit
    5   6     Banana     15        OK  Medium_Fit
    7   8  Pineapple     25        OK  Medium_Fit
    8   9  Pineapple     25        OK  Medium_Fit
    9  10     Cherry     30       Bad  Medium_Fit

Upvotes: 2

Christian Sloper
Christian Sloper

Reputation: 7510

If you don't have too many conditions (like here it is only 3), the following is an easy workaround:

df.loc[df["Condition"] == 'Good',"Condition"] = 3
df.loc[df["Condition"] == 'OK',"Condition"] = 2
df.loc[df["Condition"] == 'Bad',"Condition"] = 1

df = df.groupby(['Name','Price']).max()

df.loc[df["Condition"] == 3] = "Good"
df.loc[df["Condition"] == 2] = "OK"
df.loc[df["Condition"] == 1] = "Bad"

Upvotes: 0

Related Questions