Will Bryant
Will Bryant

Reputation: 541

pandas frequency table based on multiple crosstabs

I have data about the presence of a set of metabolites in various compartments of a system. I also have information about what type each of the metabolites is. I want a frequency table showing the number of metabolites of each type in each compartment. The data look something like this:

df = pd.DataFrame({'met_id':['met_a','met_b','met_c','met_d','met_e','met_f'],
                   'met_type':['amino_acid','amino_acid','lipid','lipid','peptide','peptide'],
                   'comp_1':[True,False,True,True,False,True],
                   'comp_2':[False,True,True,False,True,True]})
print df

gives

  comp_1 comp_2 met_id    met_type
0   True  False  met_a  amino_acid
1  False   True  met_b  amino_acid
2   True   True  met_c       lipid
3   True  False  met_d       lipid
4  False   True  met_e     peptide
5   True   True  met_f     peptide

I want a summary table (or DataFrame) like this:

met_type      comp_1 comp_2
amino_acid    1      1
lipid         2      1
peptide       1      2

indicating how many of each type of metabolite is in each compartment. I can get the counts using crosstab thus:

pd.crosstab(df_test.met_type,df_test.comp_1)[True]

gives

met_type
amino_acid    1
lipid         2
peptide       1
Name: True, dtype: int64

and I guess I can concatenate each of these series, but is there a neat way to create a table with counts for all of the compartment columns?

Upvotes: 0

Views: 469

Answers (1)

Bharath M Shetty
Bharath M Shetty

Reputation: 30605

You can use groupby mean

df = pd.DataFrame({'met_id':
['met_a','met_b','met_c','met_d','met_e','met_f'],
               'met_type':
['amino_acid','amino_acid','lipid','lipid','peptide','peptide'],
                   'comp_1':[True,False,True,True,False,True],
                  'comp_2':[False,True,True,False,True,True]})
dfn = df.groupby("met_type").mean()
dfn = dfn[['comp_1','comp_2']]*2

This will give you the summary table

            comp_1  comp_2
met_type                  
amino_acid     1.0     1.0
lipid          2.0     1.0
peptide        1.0     2.0

As you suggest we can use sum ( was still a beginner that time :) )

dfn = df.groupby("met_type")['comp_1','comp_2'].sum().astype(int)
            comp_1  comp_2
met_type                  
amino_acid       1       1
lipid            2       1
peptide          1       2

Upvotes: 1

Related Questions