George
George

Reputation: 903

Pandas vectorisation between two dataframes

import numpy as np
import pandas as pd
from scipy.stats import levene as lev 

raw_data = {'IDs': ['G1', 'G2', 'G3', 'G4', 'G5'], 
            'Sample1': [102.2, 2310.4, 123.4, 213.0, 1234.2], 
            'Sample2': [112.8, 1910.4, 36.3, 188.2, 1271.2], 
            'Sample3': [32.2, 1290.3, 121.4, 212.3, 1333.5], 
            'Sample4': [52.1, 2210.1, 155.2, 244.7, 1987.1]} 

raw_data2 = {'IDs': ['S1', 'S2', 'S3', 'S4', 'S5'], 
        'Sample1': [1, 2, 1, 0, 2], 
        'Sample2': [2, 1, 2, 1, 2], 
        'Sample3': [2, 0, 1, 0, 1], 
        'Sample4': [1, 2, 1, 2, 1]} 
df1 = pd.DataFrame(raw_data, columns = ['IDs', 'Sample1', 'Sample2', 'Sample3', 'Sample4'])
df2 = pd.DataFrame(raw_data2, columns = ['IDs', 'Sample1', 'Sample2', 'Sample3', 'Sample4'])

I have been trying to figure out a way to implement a levenes test on each row of df1, with every row df2 defining the groups to split. For example row one of df1 would be grouped by every row of df2, every time a levenes test would be run. Obviously I can implement it with a nested loops such as (have to include the if statements also as not all rows will contain all groups):

   for i in range(0, df1.shape[0]):
    for j in range(0, df2.shape[0]):
        tmp1=df1.ix[i,:]
        tmp2=df2.ix[i,:]
        group1 = tmp1[tmp2==0]
        group2 = tmp1[tmp2==1]
        group3 = tmp1[tmp2==2]
        if len(group1) <= 1:
            lev(group2,group3) # and some how return the output to a new df
        elif len(group2) <= 1:
            lev(group1,group3) # and some how return the output to a new df
        elif len(group3) <=1:
            lev(group1,group2) # and some how return the output to a new df
        else:
            lev(group1,group2,group3) # and some how return the output to a new df

The samples are in the same order in the dataframes, however one df has a few extra columns of descriptor (important to retain for output).

As I will be doing millions off comparisons it will be impractical to do it with loops, my first attempt was going to take 120 yrs... I have refined it, but need to drop the loops to really improve it.

I have been reading about trying to use vectorisation, I am somewhat familiar with it in R using the apply functions. I imagine there would be an elegant way in pandas, numpy etc but have not cracked it yet.

To make it more clear the expected output would be something like (sorry haven't calculated the lev test so don't have actual numbers - will update when get to computer):

DF1-ID DF2-ID Lev.stat Lev.pvalue
G1 S1 float float
G1 S2 float float
G1 S3 float float
G1 S4 float float
G2 S1 float float
.
.
.
G4 S4 float float

Upvotes: 1

Views: 226

Answers (1)

andrew_reece
andrew_reece

Reputation: 21274

This solution is not particularly elegant, and it's a little brittle, meaning it will fail on some edge cases. But it will work if your data don't have any unusual cases.

First, this solution assumes that the IDs columns in df1 and df2 are isomorphic (G1=S1, etc). Second, the solution assumes that there are only ever 2 or 3 groups of data to run Levene's test on. (This is true for your sample data.) Third, your sample data don't have a lot of data and sometimes this throws off the Levene test - in these instances, Scipy will complain (although it just returns Inf and completes execution).

In short, the procedure is:

  1. swing from wide to long format with melt()
  2. merge df1 and df2
  3. use groupby to make lists of the grouped values
  4. run lev() on the groups

First:

# merge data
df2.IDs = df2.IDs.str.replace("S", "G")
merged = df1.melt(id_vars="IDs").merge(df2.melt(id_vars="IDs"), on=["IDs", "variable"])

# group values into lists
lev_data = merged.groupby(["IDs", "value_y"]).value_x.agg(list).reset_index()

lev_data
   IDs  value_y                value_x
0   G1        1          [102.2, 52.1]
1   G1        2          [112.8, 32.2]
2   G2        0               [1290.3]
3   G2        1               [1910.4]
4   G2        2       [2310.4, 2210.1]
5   G3        1  [123.4, 121.4, 155.2]
6   G3        2                 [36.3]
7   G4        0         [213.0, 212.3]
8   G4        1                [188.2]
9   G4        2                [244.7]
10  G5        1       [1333.5, 1987.1]
11  G5        2       [1234.2, 1271.2]

Next:

# run lev() on two- or three-group sets
def do_lev(vals):
    if len(vals) == 2:
        return lev(vals.iloc[0,2], vals.iloc[1,2])
    return lev(vals.iloc[0,2], vals.iloc[1,2], vals.iloc[2,2])

lev_data.groupby("IDs").apply(do_lev)

Output:

IDs
G1    (1.84254995447e+31, 5.42726126677e-32) # (test statistic, p-value)
G2                                (inf, 0.0)
G3          (0.300123996469, 0.638777548242)
G4                                (inf, 0.0)
G5                                (inf, 0.0)
dtype: object

Note: Full vectorization is prevented somewhat, as scipy.stats.levene() requires each sample to be entered separately, instead of accepting a list of vectors. That means that the samples need to be broken out and entered into lev() one-by-one, which complicates vectorization.

Upvotes: 1

Related Questions