Reputation: 8628
This is a sample data frame df
:
TIME VP_1 VP_2 VP_3 EVAL
20 3242 3244 3245 0
24 3244 3244 3242 0
30 3456 3244 3456 1
33 3456 3245 3242 0
45 3242 3456 3245 1
I am calculating an average TIME
per VP_*
when EVAL
is equal to 0
and 1
.
This is a sample output for VP
equal to 3242
.
VP EVAL AVG_TIME
3242 0 25.67
3242 1 45
The problem is that I get different results when applying the following two identical codes on my real dataset. I cannot understand why this happens and which approach (of these two) is correct.
Code #1
grouped = (pd.melt(df, id_vars=['EVAL', 'TIME'], value_name='VP')
.drop('variable', axis=1).drop_duplicates()
.groupby(['EVAL', 'VP']).agg({'TIME' : 'mean'})
.reset_index())
Code #2
cols = ['VP', 'TIME', 'EVAL']
grouped = pd.melt(
df, ['TIME', 'EVAL'],
['VP_1', 'VP_2', 'VP_3'],
value_name='VP')[cols]
ab = grouped.groupby(['EVAL','VP']).agg({'TIME' : 'mean'}).reset_index()
Upvotes: 1
Views: 226
Reputation: 862671
There is difference with drop_duplicates
:
drop('variable', axis=1)
is same as [cols]
- both remove column variable
.drop_duplicates()
So row 6
and 12
is removed because duplicates:
grouped = pd.melt(df, id_vars=['EVAL', 'TIME'], value_name='VP')
.drop('variable', axis=1).drop_duplicates()
print (grouped)
EVAL TIME VP
0 0 20 3242
1 0 24 3244
2 1 30 3456
3 0 33 3456
4 1 45 3242
5 0 20 3244
7 1 30 3244
8 0 33 3245
9 1 45 3456
10 0 20 3245
11 0 24 3242
13 0 33 3242
14 1 45 3245
cols = ['VP', 'TIME', 'EVAL']
grouped = pd.melt(df, ['TIME', 'EVAL'], ['VP_1', 'VP_2', 'VP_3'], value_name='VP')[cols]
print (grouped)
VP TIME EVAL
0 3242 20 0
1 3244 24 0
2 3456 30 1
3 3456 33 0
4 3242 45 1
5 3244 20 0
6 3244 24 0
7 3244 30 1
8 3245 33 0
9 3456 45 1
10 3245 20 0
11 3242 24 0
12 3456 30 1
13 3242 33 0
14 3245 45 1
Upvotes: 1