Reputation: 4545
I am running only the following three lines:
df = pd.read_hdf('data.h5')
print(df.mean())
print(df['derived_3'].mean())
The first print
lists all of the individual means for each column, with one of these being
derived_3 -5.046012e-01
The second print
gives the mean of just this column alone and is giving the result
-0.504715
Despite the difference in using the scientific notation and not, these values differ - why is this so?
Examples Using Other Methods
Performing the same with sum()
results in the following:
derived_3 -7.878262e+05
-788004.0
Again, slightly different results, but count()
returns the same results:
derived_3 1561285
1561285
Also, the result of df.head()
:
id timestamp derived_0 derived_1 derived_2 derived_3 derived_4 \
0 10 0 0.370326 -0.006316 0.222831 -0.213030 0.729277
1 11 0 0.014765 -0.038064 -0.017425 0.320652 -0.034134
2 12 0 -0.010622 -0.050577 3.379575 -0.157525 -0.068550
3 25 0 NaN NaN NaN NaN NaN
4 26 0 0.176693 -0.025284 -0.057680 0.015100 0.180894
fundamental_0 fundamental_1 fundamental_2 ... technical_36 \
0 -0.335633 0.113292 1.621238 ... 0.775208
1 0.004413 0.114285 -0.210185 ... 0.025590
2 -0.155937 1.219439 -0.764516 ... 0.151881
3 0.178495 NaN -0.007262 ... 1.035936
4 0.139445 -0.125687 -0.018707 ... 0.630232
technical_37 technical_38 technical_39 technical_40 technical_41 \
0 NaN NaN NaN -0.414776 NaN
1 NaN NaN NaN -0.273607 NaN
2 NaN NaN NaN -0.175710 NaN
3 NaN NaN NaN -0.211506 NaN
4 NaN NaN NaN -0.001957 NaN
technical_42 technical_43 technical_44 y
0 NaN -2.0 NaN -0.011753
1 NaN -2.0 NaN -0.001240
2 NaN -2.0 NaN -0.020940
3 NaN -2.0 NaN -0.015959
4 NaN 0.0 NaN -0.007338
Upvotes: 2
Views: 6791
Reputation: 294258
pd.DataFrame
method versus pd.Series
method
In df.mean()
, mean
is pd.DataFrame.mean
and operates on all columns as separate pd.Series
. What is returned is a pd.Series
in which df.columns
is the new index and the means of each column are the values. In your initial example, df
only has one column so the result is a length one series where the index was the name of that one column and the value was the mean for that one column.
In df['derived_3'].mean()
, mean
is pd.Series.mean
and df['derived_3']
is a pd.Series
. The result of pd.Series.mean
will be a scalar.
Display Differences
The difference in display is because the result of df.mean
is a pd.Series
and the float format is controlled by pandas
. On the other hand df['derived_3'].mean()
is python primitive and isn't controlled by pandas.
import numpy as np
import pandas as pd
scalar
np.pi
3.141592653589793
pd.Series
pd.Series(np.pi)
0 3.141593
dtype: float64
with different formatting
with pd.option_context('display.float_format', '{:0.15f}'.format):
print(pd.Series(np.pi))
0 3.141592653589793
dtype: float64
Reduction
It is useful to think of these various methods as either reducing the dimensionality or not. Or synonymously, aggregation or transformation.
pd.DataFrame
results in a pd.Series
pd.Series
results in a scalarMethods That Reduce
mean
sum
std
Upvotes: 4