user4896331
user4896331

Reputation: 1695

Finding clusters in string data

I have surveyed people about which fruit they like to eat (see data below) and I want to see whether there are clusters in the data. Do people who like bananas frequently also like loganberries, say. There's 23 different types of fruit and 400 respondents.

I would like to conduct the analysis in Python with Pandas, because that's what I know best. If this is a sane option, is there a common approach to this type of problem (there seems to be a lot of conflicting advice)? Does anyone have a recommended approach?

Participant | Bananas |  Apples | Kumquats | Loganberries
------------|-------------------------------------------
1           |  Yes   |   No    |   Yes    |    Yes
2           |  Yes   |   Yes   |   No     |    Yes
3           |  Yes   |   No    |   Yes    |    No
4           |  No    |   No    |   No     |    Yes
5           |  Yes   |   No    |   Yes    |    Yes
6           |  Yes   |   Yes   |   No     |    No

Upvotes: 0

Views: 237

Answers (1)

Corralien
Corralien

Reputation: 120391

Use corr to get the correlation matrix:

out = df.set_index('Participant').replace({'Yes': 1, 'No': 0}).corr()
print(out)

# Output
               Bananas    Apples  Kumquats  Loganberries
Bananas       1.000000  0.316228  0.447214     -0.316228
Apples        0.316228  1.000000 -0.707107     -0.250000
Kumquats      0.447214 -0.707107  1.000000      0.000000
Loganberries -0.316228 -0.250000  0.000000      1.000000

Upvotes: 1

Related Questions