Reputation: 117
my data have approximately this scheme:
Category | Value1 | Value2 | Value3 |
---|---|---|---|
A | 5.8 | 7.2 | 8.8 |
A | 5.7 | 6.7 | 4.5 |
B | 8.5 | 7.3 | 2.2 |
C | 5.3 | 0.4 | 4.1 |
C | 4.2 | 9.5 | 9.3 |
C | 5.9 | 7.6 | 5.3 |
D | 7.6 | 3.5 | 2.3 |
D | 6.8 | 8.8 | 6.4 |
So my aim is to calculate the correlations. Whether the Values 1-3 are affected differently depending on the category. E.g. if we can say that Category A leads to a higher Value 1 than the other categories. What is the best and shortest way to achieve this in Python?
Upvotes: 0
Views: 209
Reputation: 13821
I am not fully confident in how you want to approach this. But given your question, you can check the difference in Value columns for each categories in a 'short' way using a grouped mean:
df.groupby('Category').mean()
Value1 Value2 Value3
Category
A 5.750000 6.950000 6.650000
B 8.500000 7.300000 2.200000
C 5.133333 5.833333 6.233333
D 7.200000 6.150000 4.350000
This shows you that contrary to your expectations Category A leads to a lower value in Value 1 than the the rest.
You can also calculate the percentage change for each category, moving from each Value to the next:
df.groupby('Category').mean().pct_change(axis=1).fillna(0)
Value1 Value2 Value3
Category
A 0.0 0.208696 -0.043165
B 0.0 -0.141176 -0.698630
C 0.0 0.136364 0.068571
D 0.0 -0.145833 -0.292683
To get the p-values, you can use a very simple linear regression. There are many sources online that will help you here. However, at it's simplest terms:
from statsmodels.formula.api import ols
fit = ols('Value1 ~ C(Category)', data=df).fit()
#fit.summary()
>>> fit.pvalues.reset_index().rename({0:'p_values'},axis=1)
index p_values
0 Intercept 0.000269
1 C(Category)[T.B] 0.028933
2 C(Category)[T.C] 0.372288
3 C(Category)[T.D] 0.097482
Upvotes: 2