Reputation: 161
I have a DataFrame stats with a Multindex and 8 samples (only two shown here) and 8 genes for each sample.
In[13]:stats
Out[13]:
ARG/16S \
count mean std min
sample gene
Arnhem IC 11.0 2.319050e-03 7.396130e-04 1.503150e-03
Int1 11.0 7.243040e+00 6.848327e+00 1.364879e+00
Sul1 11.0 3.968956e-03 9.186019e-04 2.499074e-03
TetB 2.0 1.154748e-01 1.627663e-01 3.816936e-04
TetM 4.0 1.083125e-04 5.185259e-05 5.189226e-05
blaOXA 4.0 4.210963e-06 3.783235e-07 3.843571e-06
ermB 4.0 4.111081e-05 7.894879e-06 3.288865e-05
ermF 4.0 2.335210e-05 4.519758e-06 1.832037e-05
Basel Aph3a 4.0 7.815592e-06 1.757242e-06 5.539389e-06
IC 11.0 5.095161e-03 5.639278e-03 1.302205e-03
Int1 12.0 1.333068e+01 1.872207e+01 4.988048e-02
Sul1 11.0 1.618617e-02 1.988817e-02 2.970397e-03
I'm trying to calculate the p-value (Students t-test) for each of these samples, comparing each of the genes between them.
I've used scipy.stats.ttest_ind_from_stats but I managed to get the p-values for the different samples for one gene and only those of the samples neighboring each other.
Experiments = list(values1_16S['sample'].unique())
for exp in Experiments:
if Experiments.index(exp)<len(Experiments)-1:
second = Experiments[Experiments.index(exp)+1]
else:
second = Experiments[0]
tstat, pvalue = scipy.stats.ttest_ind_from_stats(stats.loc[(exp,'Sul1')]['ARG/16S','mean'],
stats.loc[(exp,'Sul1')]['ARG/16S','std'],
stats.loc[(exp,'Sul1')]['ARG/16S','count'],
stats.loc[(second,'Sul1')]['ARG/16S','mean'],
stats.loc[(second,'Sul1')]['ARG/16S','std'],
stats.loc[(second,'Sul1')]['ARG/16S','count'])
d.append({'loc1':exp, 'loc2':second, 'pvalue':pvalue})
stats_Sul1 = pd.DataFrame(d)
stats_Sul1
How can I get the pvalues between ALL samples? And is there a way to do this for all genes at once without running the code one by one for each gene?
Upvotes: 6
Views: 28530
Reputation: 29635
Let's suppose you have the same X genes for the Y samples. I try my method with X=3 and Y=2 but I guess you can generalize. I started with:
df1 =
count mean std min
sample gene
Arnhem IC 11 0.002319 0.000740 0.001503
Int1 11 7.243040 6.848327 1.364879
Sul1 11 0.003969 0.000919 0.002499
Basel IC 11 0.005095 0.005639 0.001302
Int1 12 13.330680 18.722070 0.049880
Sul1 11 0.016186 0.019888 0.002970
Note that the genes need to be in the same order.
First reset_index()
with df_reindex = df1.reset_index()
, I'm not sure what I'm doing is possible with multiindex:
df_reindex =
sample gene count mean std min
0 Arnhem IC 11 0.002319 0.000740 0.001503
1 Arnhem Int1 11 7.243040 6.848327 1.364879
2 Arnhem Sul1 11 0.003969 0.000919 0.002499
3 Basel IC 11 0.005095 0.005639 0.001302
4 Basel Int1 12 13.330680 18.722070 0.049880
5 Basel Sul1 11 0.016186 0.019888 0.002970
I create a rolled DF and join it to df_reindex
:
nb_genes = 3
df_rolled = pd.DataFrame(pd.np.roll(df_reindex,nb_genes,0), columns = df_reindex.columns)
df_joined = df_reindex.join(df_rolled, rsuffix='_')
# rsuffix='_' is to be able to perform the join
Now on a same row, I have all data you needto calculate pvalue
and create the column with apply
:
df_joined['pvalue'] = df_joined.apply(lambda x: stats.ttest_ind_from_stats(x['mean'],x['std'],x['count'], x['mean_'],x['std_'],x['count_'])[1],axis=1)
Finally, I create a DF with the data you want and rename columns:
df_output = df_joined[['sample','sample_','gene','pvalue']].rename(columns = {'sample':'loc1', 'sample_':'loc2'})
You ends up with data:
df_output =
loc1 loc2 gene pvalue
0 Arnhem Basel IC 0.121142
1 Arnhem Basel Int1 0.321072
2 Arnhem Basel Sul1 0.055298
3 Basel Arnhem IC 0.121142
4 Basel Arnhem Int1 0.321072
5 Basel Arnhem Sul1 0.055298
That you can reindex as you need.
If you want to do it each sample against each other, I think a loop for
could do it.
EDIT: Using pivot_table
, I think there is a easier way.
With your input stats
as multiindex table for only ARG/16S
(not sure how to handle this level), so I start with (which might be your stats['ARG/16S']
):
df=
count mean std min
sample gene
Arnhem IC 11 0.002319 7.396130e-04 0.001503
Int1 11 7.243040 6.848327e+00 1.364879
Sul1 11 0.003969 9.186019e-04 0.002499
TetB 2 0.115475 1.627663e-01 0.000382
TetM 4 0.000108 5.185259e-05 0.000052
blaOXA 4 0.000004 3.783235e-07 0.000004
ermB 4 0.000041 7.894879e-06 0.000033
ermF 4 0.000023 4.519758e-06 0.000018
Basel Aph3a 4 0.000008 1.757242e-06 0.000006
IC 11 0.005095 5.639278e-03 0.001302
Int1 12 13.330680 1.872207e+01 0.049880
Sul1 11 0.016186 1.988817e-02 0.002970
With the function pivot_table
, you can rearrange your data such as:
df_pivot = df.pivot_table(values = ['count','mean','std'], index = 'gene',
columns = 'sample', fill_value = 0)
In this df_pivot
(I don't print it here for readability but at the end with the new column), you can create a column for each couple (sample1, sample2) using itertools
and apply
:
import itertools
for sample1, sample2 in itertools.combinations(df.index.levels[0],2):
# itertools.combinations create all combinations between your samples
df_pivot[sample1+ '_' + sample2 ] = df_pivot.apply(lambda x: stats.ttest_ind_from_stats(x['mean'][sample1],x['std'][sample1],x['count'][sample1],
x['mean'][sample2 ],x['std'][sample2 ],x['count'][sample2 ],)[1],axis=1).fillna(1)
I think this method is independent of the number of samples, genes and if genes are not all the same, you ends up with df_pivot
like:
count mean std Arnhem_Basel
sample Arnhem Basel Arnhem Basel Arnhem Basel
gene
Aph3a 0 4 0.000000 0.000008 0.000000e+00 0.000002 1.000000
IC 11 11 0.002319 0.005095 7.396130e-04 0.005639 0.121142
Int1 11 12 7.243040 13.330680 6.848327e+00 18.722070 0.321072
Sul1 11 11 0.003969 0.016186 9.186019e-04 0.019888 0.055298
TetB 2 0 0.115475 0.000000 1.627663e-01 0.000000 1.000000
TetM 4 0 0.000108 0.000000 5.185259e-05 0.000000 1.000000
blaOXA 4 0 0.000004 0.000000 3.783235e-07 0.000000 1.000000
ermB 4 0 0.000041 0.000000 7.894879e-06 0.000000 1.000000
ermF 4 0 0.000023 0.000000 4.519758e-06 0.000000 1.000000
Let me know if it works
EDIT2: to reply to the comment, I think you can do this:
No change for df_pivot
and then you create a multiindex DF df_multi
to write your results in:
df_multi = pd.DataFrame(index = df.index.levels[1],
columns = pd.MultiIndex.from_tuples([p for p in itertools.combinations(df.index.levels[0],2)])).fillna(0)
Then you use the loop for
to implement the data in this df_multi
:
for sample1, sample2 in itertools.combinations(df.index.levels[0],2):
# itertools.combinations create all combinations between your samples
df_multi.loc[:,(sample1,sample2)] = df_pivot.apply(lambda x: stats.ttest_ind_from_stats(x['mean'][sample1],x['std'][sample1],x['count'][sample1],
x['mean'][sample2 ],x['std'][sample2 ],x['count'][sample2 ],)[1],axis=1).fillna(1)
Finally, you can use transpose
and unstack
on level 1 to get the way you ask (or close if I misunderstood)
df_output = df_multi.transpose().unstack(level=[1]).fillna(1)
You will see that you don't have the last sample in indexes and first sample in columns (because they don't exist how I built everything) if you want them, you need to replace itertools.combinations
by itertools.combinations_with_replacement
in both the creation of df_multi
and in the loop for
( I didn't try it but it should work)
Upvotes: 10