Reputation: 1222
I wrote a small class to compute some statistics through bootstrap without replacement. For those not familiar with this technique, you get n
random subsamples of some data, compute the desired statistic (lets say the median) on each subsample, and then compare the values across subsamples. This allows you to get a measure of variance on the obtained median over the dataset.
I implemented this in a class but reduced it to a MWE given by the following function
import numpy as np
import pandas as pd
def bootstrap_median(df, n=5000, fraction=0.1):
if isinstance(df, pd.DataFrame):
columns = df.columns
else:
columns = None
# Get the values as a ndarray
arr = np.array(df.values)
# Get the bootstrap sample through random permutations
sample_len = int(len(arr)*fraction)
if sample_len<1:
sample_len = 1
sample = []
for n_sample in range(n):
sample.append(arr[np.random.permutation(len(arr))[:sample_len]])
sample = np.array(sample)
# Compute the median on each sample
temp = np.median(sample, axis=1)
# Get the mean and std of the estimate across samples
m = np.mean(temp, axis=0)
s = np.std(temp, axis=0)/np.sqrt(len(sample))
# Convert output to DataFrames if necesary and return
if columns:
m = pd.DataFrame(data=m[None, ...], columns=columns)
s = pd.DataFrame(data=s[None, ...], columns=columns)
return m, s
This function returns the mean and standard deviation across the medians computed on each bootstrap sample.
Now consider this example DataFrame
data = np.arange(20)
group = np.tile(np.array([1, 2]).reshape(-1,1), (1,10)).flatten()
df = pd.DataFrame.from_dict({'data': data, 'group': group})
print(df)
print(bootstrap_median(df['data']))
this prints
data group
0 0 1
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 2
11 11 2
12 12 2
13 13 2
14 14 2
15 15 2
16 16 2
17 17 2
18 18 2
19 19 2
(9.5161999999999995, 0.056585753613431718)
So far so good because bootstrap_median
returns a tuple
of two elements. However, if I do this after a groupby
In: df.groupby('group')['data'].apply(bootstrap_median)
Out:
group
1 (4.5356, 0.0409710449952)
2 (14.5006, 0.0403772204095)
The values inside each cell are tuple
s, as one would expect from apply
. I can unpack the result into two DataFrame
's by iterating over elements like this:
index = []
data1 = []
data2 = []
for g, (m, s) in out.iteritems():
index.append(g)
data1.append(m)
data2.append(s)
dfm = pd.DataFrame(data=data1, index=index, columns=['E[median]'])
dfm.index.name = 'group'
dfs = pd.DataFrame(data=data2, index=index, columns=['std[median]'])
dfs.index.name = 'group'
thus
In: dfm
Out:
E[median]
group
1 4.5356
2 14.5006
In: dfs
Out:
std[median]
group
1 0.0409710449952
2 0.0403772204095
This is a bit cumbersome and my question is if there is a more pandas
native way to "unpack" a dataframe whose values are tuples into separate DataFrame
's
This question seemed related but it concerned string regex replacements and not unpacking true tuples.
Upvotes: 1
Views: 980
Reputation: 862581
I think you need change:
return m, s
to:
return pd.Series([m, s], index=['m','s'])
And then get:
df1 = df.groupby('group')['data'].apply(bootstrap_median)
print (df1)
group
1 m 4.480400
s 0.040542
2 m 14.565200
s 0.040373
Name: data, dtype: float64
So is possible select by xs
:
print (df1.xs('s', level=1))
group
1 0.040542
2 0.040373
Name: data, dtype: float64
print (df1.xs('m', level=1))
group
1 4.4804
2 14.5652
Name: data, dtype: float64
Also if need one column DataFrame
add to_frame
:
df1 = df.groupby('group')['data'].apply(bootstrap_median).to_frame()
print (df1)
data
group
1 m 4.476800
s 0.041100
2 m 14.468400
s 0.040719
print (df1.xs('s', level=1))
data
group
1 0.041100
2 0.040719
print (df1.xs('m', level=1))
data
group
1 4.4768
2 14.4684
Upvotes: 2