Reputation: 4933
I'm using a Pandas DataFrame to do a row-wise t-test as per this example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.log2(np.randn(1000, 4), columns=["a", "b", "c", "d"]).dropna()
Now, suppose I have "a" and "b" as one group, and "c" and "d" at the other, I'm performing the t-test row-wise. This is fairly trivial with pandas, using apply
with axis=1
. However, I can either return a DataFrame of the same shape if my function doesn't aggregate, or a Series if it aggregates.
Normally I would just output the p-value (so, aggregation) but I would like to generate an additional value based on other calculations (in other words, return two values). I can of course do two runs, aggregating the p-values first, then doing the other work, but I was wondering if there is a more efficient way to do so as the data is reasonably large.
As an example of the calculation, a hypothetical function would be:
from scipy.stats import ttest_ind
def t_test_and_mean(series, first, second):
first_group = series[first]
second_group = series[second]
_, pvalue = ttest_ind(first_group, second_group)
mean_ratio = second_group.mean() / first_group.mean()
return (pvalue, mean_ratio)
Then invoked with
df.apply(t_test_and_mean, first=["a", "b"], second=["c", "d"], axis=1)
Of course in this case it returns a single Series with the two tuples as value.
Instead, my expected output would be a DataFrame with two columns, one for the first result, and one for the second. Is this possible or I have to do two runs for the two calculations, then merge them together?
Upvotes: 59
Views: 53206
Reputation: 23011
apply(axis=1)
apply
has result_type=
parameter that can expand a result into a dataframe. For OP's case, that would look like the following (note that the original function doesn't need to be touched):
df[['pvalue', 'mean_ratio']] = df.apply(t_test_and_mean, first=["a", "b"], second=["c", "d"], result_type='expand', axis=1)
Casting each row into a pandas Series is painfully slow (for a frame with 10k rows, it takes 20 seconds). A faster solution is to convert the values returned from an apply
call into a list and cast into a DataFrame once (or assign back to the dataframe). Or use a Python loop for an even faster solution (how that can be written is shown at the end of this post).
For the case in the OP, that would look like the following (again, the original function shouldn't be altered).
df[['pvalue', 'mean_ratio']] = df.apply(t_test_and_mean, first=["a", "b"], second=["c", "d"], axis=1).values.tolist()
# or create a new frame
new_df = pd.DataFrame(df.apply(t_test_and_mean, first=["a", "b"], second=["c", "d"], axis=1).values.tolist(), index=df.index, columns=['pvalue', 'mean_ratio'])
groupby.apply
The same can done for functions called via groupby.apply
as well. Simply convert the result into a list and cast into a new dataframe. For example, if we call the function in the OP in a groupby call, the result could be fixed up as follows:
# sample data
df = pd.DataFrame(np.random.randn(1000, 4), columns=["a", "b", "c", "d"])
df['grouper'] = list(range(10))*100
# perform groupby
x = df.groupby('grouper').apply(t_test_and_mean, first="a", second="c")
# convert to the groupby result into a list and cast into a dataframe
# in order to not lose any information, assign index and axis name appropriately
agg_df = pd.DataFrame(x.tolist(), index=x.index, columns=['pvalue', 'mean_ratio']).rename_axis(x.index.name)
apply(axis=1)
Another solution (slower than converting to list) is to chain an .apply(pd.Series)
call:
df[['pvalue', 'mean_ratio']] = df.apply(t_test_and_mean, first=["a", "b"], second=["c", "d"], axis=1).apply(pd.Series)
Since .apply(axis=1)
is syntactic sugar for a Python loop, the biggest speed up would be to convert the frame into a list, re-write the function into one that works with Python lists and just use a list comprehension (this would speed up the process about 6 times). For the example in the OP, that would look like:
def t_test_and_mean_on_lists(first_group, second_group):
_, pvalue = ttest_ind(first_group, second_group)
mean_ratio = np.mean(second_group) / np.mean(first_group)
return (pvalue, mean_ratio)
df[['pvalue', 'mean_ratio']] = [t_test_and_mean_on_lists(ab, cd) for ab, cd in zip(df[['a','b']].values.tolist(), df[['c','d']].values.tolist())]
Decorating this function with numba.njit
would make it even faster but that's outside the scope of this question.
Upvotes: 1
Reputation: 49768
Returning a Series, rather than tuple, should produce a new multi-column DataFrame. For example,
return pandas.Series({'pvalue': pvalue, 'mean_ratio': mean_ratio})
Upvotes: 105