Reputation: 85

dataframe in pandas with certain conditions

am trying to combine features of in a dataframe to derive a new columns in the dataframe

I have this dataframe

Id   Author   News_post  Label
1    Jessica  xxxxxxxxx  1
2    Adams    xxxxxxxxx  1
3    Adams    xxxxxxxxx  1
4    Mike     xxxxxxxxx  0
5    James    xxxxxxxxx  1
6    Mike     xxxxxxxxx  1
7    Mike     xxxxxxxxx  0
8    Paul     xxxxxxxxx  0
9    Jessica  xxxxxxxxx  0
10   Adams    xxxxxxxxx  0

NB: where the Label column have 1=TRUE AND 0=FALSE

Id   Author   Num_Post  Num_True_Label  Num_False_Label   Mean
1    Adams    3         2               1                 x
2    James    1         1               0                 x
3    Jessica  2         1               1                 x
4    Mike     2         0               1                 x
5    Paul     1         0               0                 x

Upvotes: 1

Answers (5)

Ayoub ZAROU

Reputation: 2417

you could try :

agg_df = df.groupby('Author')['Label'].agg({"Num_post" : 'count', 'Num_True_Label' : 
                                             lambda x : x.eq(1).sum(), 
                                            'Num_False_Label':lambda x : 
                                            x.eq(0).sum(), 
                                            'Mean':'mean'}).reset_index()

Upvotes: 0

adhg

Reputation: 10863

Use transform and then remove the duplicates such that:

df['Num_Post']= df.groupby(['Author'])['Label'].transform('count')
df['Num_True_Label']= df.groupby(['Author'])['Label'].transform('sum')
df['Num_False_Label']= df['Num_Post']-df['Num_True_Label']
df['Mean']= df['Num_Post']/df['Num_True_Label']

Finally: drop dups and remove the News_post

df.drop(columns=['News_post'], inplace=True)
df.drop_duplicates(subset='Author', keep='first').sort_values(by=['Author'])

result:

    Id  Author      Label   Num_Post    Num_True_Label  Num_False_Label Mean
    1   2   Adams       1       3           2               1               1.500000
    4   5   James       1       1           1               0               1.000000
    0   1   Jessica     1       2           1               1               2.000000
    3   4   Mike        0       3           1               2               3.000000
    7   8   Paul        0       1           0               1               inf

Note: change the mean for your definition.

Upvotes: 0

Scott Boston

Reputation: 153460

Using Pandas 0.25 with aggregation relabeling

df.groupby('Author')['Label'].agg(Num_Post = 'size',
                                  Num_True = 'sum',
                                  Num_False = lambda x: x.eq(0).sum(),
                                  Mean = 'mean')

Output:

         Num_Post  Num_True  Num_False      Mean
Author                                          
Adams           3         2          1  0.666667
James           1         1          0  1.000000
Jessica         2         1          1  0.500000
Mike            3         1          2  0.333333
Paul            1         0          1  0.000000

Upvotes: 1

Gabriela Melo

Reputation: 619

The following will get you what you need:

In [1]: import pandas as pd                                                                                                                                                                                                                  

In [2]: df = pd.DataFrame({'Author': ['Jessica', 'Adams', 'Adams', 'Mike', 'James', 'Mike', 'Mike', 'Paul', 'Jessica', 'Adams'], 'News_post': ['xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xx
    ...: xxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx'], 'Label': [1,1,1,0,1,1,0,0,0,0]})                                                                                                                                                                

In [3]: num_true_label_df = df.groupby(by=['Author']).sum().rename(columns={'Label': 'Num_True_Label'}).reset_index()                                                                                                                        

In [4]: num_post_df = df.groupby(by=['Author']).count().rename(columns={'News_post': 'Num_Post'})[['Num_Post']].reset_index()                                                                                                                

In [5]: df = pd.merge(num_post_df, num_true_label_df, how='left', on='Author').reset_index().rename(columns={'index': 'Id'})

In [6]: df['Id'] = df['Id'] + 1

In [7]: df['Num_False_Label'] = df['Num_Post'] - df['Num_True_Label']

In [8]: df                                                                                                                                                                                                                                
Out[7]: 
   Id   Author  Num_Post  Num_True_Label  Num_False_Label
0   1    Adams         3               2                1
1   2    James         1               1                0
2   3  Jessica         2               1                1
3   4     Mike         3               1                2
4   5     Paul         1               0                1

Please further specify what your Mean column should represent.

Some resources which might be helpful: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

Upvotes: 1

Sazzy

Reputation: 1994

This may solve a number of things you are trying to get from your issue:

df = pd.read_clipboard()  # just copied your dataframe
df = df.groupby('Author').describe()

Output:

           Id                                               Label                                               
        count      mean       std  min  25%  50%  75%   max count      mean       std  min   25%  50%   75%  max
Author                                                                                                          
Adams     3.0  5.000000  4.358899  2.0  2.5  3.0  6.5  10.0   3.0  0.666667  0.577350  0.0  0.50  1.0  1.00  1.0
James     1.0  5.000000       NaN  5.0  5.0  5.0  5.0   5.0   1.0  1.000000       NaN  1.0  1.00  1.0  1.00  1.0
Jessica   2.0  5.000000  5.656854  1.0  3.0  5.0  7.0   9.0   2.0  0.500000  0.707107  0.0  0.25  0.5  0.75  1.0
Mike      3.0  5.666667  1.527525  4.0  5.0  6.0  6.5   7.0   3.0  0.333333  0.577350  0.0  0.00  0.0  0.50  1.0
Paul      1.0  8.000000       NaN  8.0  8.0  8.0  8.0   8.0   1.0  0.000000       NaN  0.0  0.00  0.0  0.00  0.0

Upvotes: 1

dataframe in pandas with certain conditions

Answers (5)

Related Questions