Reputation: 903
import numpy as np
import pandas as pd
from scipy.stats import levene as lev
raw_data = {'IDs': ['G1', 'G2', 'G3', 'G4', 'G5'],
'Sample1': [102.2, 2310.4, 123.4, 213.0, 1234.2],
'Sample2': [112.8, 1910.4, 36.3, 188.2, 1271.2],
'Sample3': [32.2, 1290.3, 121.4, 212.3, 1333.5],
'Sample4': [52.1, 2210.1, 155.2, 244.7, 1987.1]}
raw_data2 = {'IDs': ['S1', 'S2', 'S3', 'S4', 'S5'],
'Sample1': [1, 2, 1, 0, 2],
'Sample2': [2, 1, 2, 1, 2],
'Sample3': [2, 0, 1, 0, 1],
'Sample4': [1, 2, 1, 2, 1]}
df1 = pd.DataFrame(raw_data, columns = ['IDs', 'Sample1', 'Sample2', 'Sample3', 'Sample4'])
df2 = pd.DataFrame(raw_data2, columns = ['IDs', 'Sample1', 'Sample2', 'Sample3', 'Sample4'])
I have been trying to figure out a way to implement a levenes test on each row of df1, with every row df2 defining the groups to split. For example row one of df1 would be grouped by every row of df2, every time a levenes test would be run. Obviously I can implement it with a nested loops such as (have to include the if statements also as not all rows will contain all groups):
for i in range(0, df1.shape[0]):
for j in range(0, df2.shape[0]):
tmp1=df1.ix[i,:]
tmp2=df2.ix[i,:]
group1 = tmp1[tmp2==0]
group2 = tmp1[tmp2==1]
group3 = tmp1[tmp2==2]
if len(group1) <= 1:
lev(group2,group3) # and some how return the output to a new df
elif len(group2) <= 1:
lev(group1,group3) # and some how return the output to a new df
elif len(group3) <=1:
lev(group1,group2) # and some how return the output to a new df
else:
lev(group1,group2,group3) # and some how return the output to a new df
The samples are in the same order in the dataframes, however one df has a few extra columns of descriptor (important to retain for output).
As I will be doing millions off comparisons it will be impractical to do it with loops, my first attempt was going to take 120 yrs... I have refined it, but need to drop the loops to really improve it.
I have been reading about trying to use vectorisation, I am somewhat familiar with it in R using the apply functions. I imagine there would be an elegant way in pandas, numpy etc but have not cracked it yet.
To make it more clear the expected output would be something like (sorry haven't calculated the lev test so don't have actual numbers - will update when get to computer):
DF1-ID DF2-ID Lev.stat Lev.pvalue
G1 S1 float float
G1 S2 float float
G1 S3 float float
G1 S4 float float
G2 S1 float float
.
.
.
G4 S4 float float
Upvotes: 1
Views: 226
Reputation: 21274
This solution is not particularly elegant, and it's a little brittle, meaning it will fail on some edge cases. But it will work if your data don't have any unusual cases.
First, this solution assumes that the IDs
columns in df1
and df2
are isomorphic (G1
=S1
, etc). Second, the solution assumes that there are only ever 2 or 3 groups of data to run Levene's test on. (This is true for your sample data.) Third, your sample data don't have a lot of data and sometimes this throws off the Levene test - in these instances, Scipy will complain (although it just returns Inf
and completes execution).
In short, the procedure is:
melt()
df1
and df2
groupby
to make lists of the grouped values lev()
on the groups First:
# merge data
df2.IDs = df2.IDs.str.replace("S", "G")
merged = df1.melt(id_vars="IDs").merge(df2.melt(id_vars="IDs"), on=["IDs", "variable"])
# group values into lists
lev_data = merged.groupby(["IDs", "value_y"]).value_x.agg(list).reset_index()
lev_data
IDs value_y value_x
0 G1 1 [102.2, 52.1]
1 G1 2 [112.8, 32.2]
2 G2 0 [1290.3]
3 G2 1 [1910.4]
4 G2 2 [2310.4, 2210.1]
5 G3 1 [123.4, 121.4, 155.2]
6 G3 2 [36.3]
7 G4 0 [213.0, 212.3]
8 G4 1 [188.2]
9 G4 2 [244.7]
10 G5 1 [1333.5, 1987.1]
11 G5 2 [1234.2, 1271.2]
Next:
# run lev() on two- or three-group sets
def do_lev(vals):
if len(vals) == 2:
return lev(vals.iloc[0,2], vals.iloc[1,2])
return lev(vals.iloc[0,2], vals.iloc[1,2], vals.iloc[2,2])
lev_data.groupby("IDs").apply(do_lev)
Output:
IDs
G1 (1.84254995447e+31, 5.42726126677e-32) # (test statistic, p-value)
G2 (inf, 0.0)
G3 (0.300123996469, 0.638777548242)
G4 (inf, 0.0)
G5 (inf, 0.0)
dtype: object
Note: Full vectorization is prevented somewhat, as scipy.stats.levene()
requires each sample to be entered separately, instead of accepting a list of vectors. That means that the samples need to be broken out and entered into lev()
one-by-one, which complicates vectorization.
Upvotes: 1