Reputation: 1695
I have surveyed people about which fruit they like to eat (see data below) and I want to see whether there are clusters in the data. Do people who like bananas frequently also like loganberries, say. There's 23 different types of fruit and 400 respondents.
I would like to conduct the analysis in Python with Pandas, because that's what I know best. If this is a sane option, is there a common approach to this type of problem (there seems to be a lot of conflicting advice)? Does anyone have a recommended approach?
Participant | Bananas | Apples | Kumquats | Loganberries
------------|-------------------------------------------
1 | Yes | No | Yes | Yes
2 | Yes | Yes | No | Yes
3 | Yes | No | Yes | No
4 | No | No | No | Yes
5 | Yes | No | Yes | Yes
6 | Yes | Yes | No | No
Upvotes: 0
Views: 237
Reputation: 120391
Use corr
to get the correlation matrix:
out = df.set_index('Participant').replace({'Yes': 1, 'No': 0}).corr()
print(out)
# Output
Bananas Apples Kumquats Loganberries
Bananas 1.000000 0.316228 0.447214 -0.316228
Apples 0.316228 1.000000 -0.707107 -0.250000
Kumquats 0.447214 -0.707107 1.000000 0.000000
Loganberries -0.316228 -0.250000 0.000000 1.000000
Upvotes: 1