Reputation: 4122
Is there a standard way in Python to calculate the conditional means and variances of pandas DataFrame variables? The aim is to test the data for over or under dispersion as a prerequisite for assessing whether a Poisson or Negative Binomial model is most suitable for regression.
Scanning around the R ecosystem and Cross Validated, I think R has some packages with built-in parameter dispersion methods. But I can't find a Python equivalent in pandas, SciPy or StatsModels.
This is the head of the data I'm working with. There are 25,000 observations.
aspunet c_# c_++ Ruby java
0 0 0 0 6
11 0 0 0 0
0 0 7 0 0
0 0 0 9 0
8 0 0 0 0
0 2 0 0 0
0 0 0 4 0
0 0 0 0 6
Upvotes: 4
Views: 2132
Reputation: 402
conditional = [df.groupby(col_name) for col_name in df.columns]
mean = [cond.mean() for cond in conditional]
var = [cond.var() for cond in conditional]
Upvotes: 4