Rory Devitt
Rory Devitt

Reputation: 161

Get combinations of multiple columns (Cartesian product) of a pandas dataframe?

So I have a dataframe representing various model estimates for the likelihood of each of a group of candidates winning an election.

             Steve     John      
    Model1   0.327586  0.289474 
    Model2   0.322581  0.285714 
    Model3   0.303030  0.294118

I want a dataframe representing all of the combinations of model values across columns, i.e. Cartesian product of all columns. So the above would be transformed to the below.

             model Steve     value Steve    model John     value John     
    0        Model1          0.327586       Model1         0.289474
    1        Model1          0.327586       Model2         0.285714
    2        Model1          0.327586       Model3         0.294118
    3        Model2          0.322581       Model1         0.289474
    4        Model2          0.322581       Model2         0.285714
    5        Model2          0.322581       Model3         0.294118
    6        Model3          0.303030       Model1         0.289474
    7        Model3          0.303030       Model2         0.285714
    8        Model3          0.303030       Model3         0.294118

The above is the simple case, but in theory I would like to be able to do this for N models and M candidates, resulting in a dataframe with N^M rows and 2M columns (in practice N < 20, M < 6).

In searching for an answer I've seen a lot of recommendations for the itertools module for stuff like this, but couldn't figure out how to get all of the combinations across multiple lists (itertools.combinations seems to only work for finding all of the combinations within a single list).

Upvotes: 2

Views: 2891

Answers (2)

jezrael
jezrael

Reputation: 863801

Use:

from  itertools import product

#get all combinations of all columns
a = product(*[zip(df.index, x) for x in df.T.values])
#create new columns names
cols = [c for x in df.columns for c in ('model_' + x, 'value_' + x)]
#flattening nested lists with DataFrame contructor
df1 = pd.DataFrame([[y for x in z for y in x] for z in a], columns=cols)
print (df1)
  model_Steve  value_Steve model_John  value_John
0      Model1     0.327586     Model1    0.289474
1      Model1     0.327586     Model2    0.285714
2      Model1     0.327586     Model3    0.294118
3      Model2     0.322581     Model1    0.289474
4      Model2     0.322581     Model2    0.285714
5      Model2     0.322581     Model3    0.294118
6      Model3     0.303030     Model1    0.289474
7      Model3     0.303030     Model2    0.285714
8      Model3     0.303030     Model3    0.294118

Upvotes: 4

DataPsycho
DataPsycho

Reputation: 988

It is always better to provide the code so that we can create a frame quickly, not just a table. Any way you can create a common key and can do database like cross join to get the final result. You can do it in one line but I am doing step by step.

import pandas as pd


df = pd.DataFrame({'model': ['model1', 'model2'],
                   'steve': ['a', 'b'],
                   'jhon': ['c', 'd']
                  })

# create a common key

df['key'] = 'xyz'

# create two seperate dataframe for self join
# but it is possible to use the direct operation (right side) in 
# inside of merge funciton

df_steve = df [['model', 'steve', 'key']]
df_jhon = df [['model', 'jhon', 'key']]

# self join    
pd.merge(df_steve, df_jhon, on='key', suffixes=('_steve', '_jhon')).drop('key', axis=1)

output:

  model_steve steve model_jhon jhon
0      model1     a     model1    c
1      model1     a     model2    d
2      model2     b     model1    c
3      model2     b     model2    d

One liner Code:

cross_df = pd.merge(df[['model', 'steve', 'key']], 
                    df[['model', 'jhon', 'key']], 
                    on='key', 
                    suffixes=('_steve', '_jhon')
                    ).drop('key', axis=1)

just change the column name according to your need.

Upvotes: 2

Related Questions