Graham Jones
Graham Jones

Reputation: 241

Frequency/contingency table across two groups of dummy variables

I have a Dataframe of survey results; one column per question, one row per respondent. Q1 and Q2 are multiple choice questions and are coded in dummy variable style, one column per option. E.g.:

d=pd.DataFrame({
    'Q1a':[0,0,0,1,1,0,1,1],
    'Q1b':[1,1,1,1,1,1,1,1],
    'Q1c':[1,0,0,0,0,1,1,0],
    'Q1d':[0,1,1,0,1,1,0,0],
    'Q2a':[1,1,1,0,0,0,0,0],
    'Q2b':[0,1,1,0,0,1,1,0],
    'Q2c':[1,0,1,1,1,1,0,1]
})

print d
   Q1a  Q1b  Q1c  Q1d  Q2a  Q2b  Q2c
0    0    1    1    0    1    0    1
1    0    1    0    1    1    1    0
2    0    1    0    1    1    1    1
3    1    1    0    0    0    0    1
4    1    1    0    1    0    0    1
5    0    1    1    1    0    1    1
6    1    1    1    0    0    1    0
7    1    1    0    0    0    0    1

So this example has 8 respondents, Q1 has 4 options and Q2 has 3.

I need a frequency table showing the number (or %) of respondents that ticked 'yes' for each pair combination in Q1 and Q2:

    Q2a Q2b Q2c
Q1a 0   1   3
Q1b 3   4   6
Q1c 1   2   2
Q1d 2   3   3

So 6 respondents said yes to both Q1b and Q2c.

Conceptually, a command like

import pandas as pd
pd.value_counts(
    rows    = (d['Q1a'],d['Q1b'],d['Q1c'],d['Q1d']),
    columns = (d['Q2a'],d['Q2b'],d['Q2c'])
)

Am I missing something very obvious? It feels like I am but I can't quite see it. I've searched and haven't found this addressed specifically anywhere. Python/pandas would be ideal but R is ok too.

Upvotes: 3

Views: 547

Answers (1)

user2285236
user2285236

Reputation:

You can use dot product:

d.loc[:, d.columns.str.startswith('Q1')].T.dot(d.loc[:, d.columns.str.startswith('Q2')])
Out: 
     Q2a  Q2b  Q2c
Q1a    0    1    3
Q1b    3    4    6
Q1c    1    2    2
Q1d    2    3    3

The parts d.loc[:, d.columns.str.startswith('Q1')] and d.loc[:, d.columns.str.startswith('Q2')] basically selecting columns. It can be done more clearly with d.filter(like='Q1') as @Zero suggests:

d.filter(like='Q1').T.dot(d.filter(like='Q2'))

If you want to select columns arbitrarily, you can just pass a list of column names instead. The equivalent of the above is

d.loc[:, ['Q1a', 'Q1b', 'Q1c', 'Q1d']].T.dot(d.loc[:, ['Q2a', 'Q2b', 'Q2c']])

Upvotes: 4

Related Questions