Reputation: 85
am trying to combine features of in a dataframe to derive a new columns in the dataframe
I have this dataframe
Id Author News_post Label
1 Jessica xxxxxxxxx 1
2 Adams xxxxxxxxx 1
3 Adams xxxxxxxxx 1
4 Mike xxxxxxxxx 0
5 James xxxxxxxxx 1
6 Mike xxxxxxxxx 1
7 Mike xxxxxxxxx 0
8 Paul xxxxxxxxx 0
9 Jessica xxxxxxxxx 0
10 Adams xxxxxxxxx 0
NB: where the Label
column have 1=TRUE
AND 0=FALSE
Id Author Num_Post Num_True_Label Num_False_Label Mean
1 Adams 3 2 1 x
2 James 1 1 0 x
3 Jessica 2 1 1 x
4 Mike 2 0 1 x
5 Paul 1 0 0 x
Upvotes: 1
Views: 146
Reputation: 2417
you could try :
agg_df = df.groupby('Author')['Label'].agg({"Num_post" : 'count', 'Num_True_Label' :
lambda x : x.eq(1).sum(),
'Num_False_Label':lambda x :
x.eq(0).sum(),
'Mean':'mean'}).reset_index()
Upvotes: 0
Reputation: 10863
Use transform and then remove the duplicates such that:
df['Num_Post']= df.groupby(['Author'])['Label'].transform('count')
df['Num_True_Label']= df.groupby(['Author'])['Label'].transform('sum')
df['Num_False_Label']= df['Num_Post']-df['Num_True_Label']
df['Mean']= df['Num_Post']/df['Num_True_Label']
Finally: drop dups and remove the News_post
df.drop(columns=['News_post'], inplace=True)
df.drop_duplicates(subset='Author', keep='first').sort_values(by=['Author'])
result:
Id Author Label Num_Post Num_True_Label Num_False_Label Mean
1 2 Adams 1 3 2 1 1.500000
4 5 James 1 1 1 0 1.000000
0 1 Jessica 1 2 1 1 2.000000
3 4 Mike 0 3 1 2 3.000000
7 8 Paul 0 1 0 1 inf
Note: change the mean for your definition.
Upvotes: 0
Reputation: 153460
Using Pandas 0.25 with aggregation relabeling
df.groupby('Author')['Label'].agg(Num_Post = 'size',
Num_True = 'sum',
Num_False = lambda x: x.eq(0).sum(),
Mean = 'mean')
Output:
Num_Post Num_True Num_False Mean
Author
Adams 3 2 1 0.666667
James 1 1 0 1.000000
Jessica 2 1 1 0.500000
Mike 3 1 2 0.333333
Paul 1 0 1 0.000000
Upvotes: 1
Reputation: 619
The following will get you what you need:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'Author': ['Jessica', 'Adams', 'Adams', 'Mike', 'James', 'Mike', 'Mike', 'Paul', 'Jessica', 'Adams'], 'News_post': ['xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xx
...: xxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx'], 'Label': [1,1,1,0,1,1,0,0,0,0]})
In [3]: num_true_label_df = df.groupby(by=['Author']).sum().rename(columns={'Label': 'Num_True_Label'}).reset_index()
In [4]: num_post_df = df.groupby(by=['Author']).count().rename(columns={'News_post': 'Num_Post'})[['Num_Post']].reset_index()
In [5]: df = pd.merge(num_post_df, num_true_label_df, how='left', on='Author').reset_index().rename(columns={'index': 'Id'})
In [6]: df['Id'] = df['Id'] + 1
In [7]: df['Num_False_Label'] = df['Num_Post'] - df['Num_True_Label']
In [8]: df
Out[7]:
Id Author Num_Post Num_True_Label Num_False_Label
0 1 Adams 3 2 1
1 2 James 1 1 0
2 3 Jessica 2 1 1
3 4 Mike 3 1 2
4 5 Paul 1 0 1
Please further specify what your Mean
column should represent.
Some resources which might be helpful: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
Upvotes: 1
Reputation: 1994
This may solve a number of things you are trying to get from your issue:
df = pd.read_clipboard() # just copied your dataframe
df = df.groupby('Author').describe()
Output:
Id Label
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
Author
Adams 3.0 5.000000 4.358899 2.0 2.5 3.0 6.5 10.0 3.0 0.666667 0.577350 0.0 0.50 1.0 1.00 1.0
James 1.0 5.000000 NaN 5.0 5.0 5.0 5.0 5.0 1.0 1.000000 NaN 1.0 1.00 1.0 1.00 1.0
Jessica 2.0 5.000000 5.656854 1.0 3.0 5.0 7.0 9.0 2.0 0.500000 0.707107 0.0 0.25 0.5 0.75 1.0
Mike 3.0 5.666667 1.527525 4.0 5.0 6.0 6.5 7.0 3.0 0.333333 0.577350 0.0 0.00 0.0 0.50 1.0
Paul 1.0 8.000000 NaN 8.0 8.0 8.0 8.0 8.0 1.0 0.000000 NaN 0.0 0.00 0.0 0.00 0.0
Upvotes: 1