boleneuro
boleneuro

Reputation: 23

How to generate all possible combinations of columns in a pandas dataframe with many columns?

I have the following DataFrame:

   A  B  C
0  1  3  3
1  1  9  4
2  4  6  3

I would like to create every possible unique combination of these columns without repetition so that I would end up with a dataframe containing the following data: A, B, C, A+B, A+C, B+C, A+B+C. I do not want to have any columns repeated in any combination, e.g. A+A+B+C or A+B+B+C.

I would also like to have each column in the dataframe labelled with the relevant variable names (e.g. for the combination of A + B, column name should be 'A_B')

This is the desired DataFrame:

   A  B  C  A_B  A_C  B_C  A_B_C
0  1  1  4    2    5    5      6
1  3  9  6   12    9   15     18
2  3  4  3    7    6    7     10

This is relatively easy with just 3 variables using itertools and I have used the following code to do it:

    import pandas as pd
    import itertools

    combos_2 = pd.DataFrame({'{}_{}'.format(a, b):
    df[a] + df[b] 
    for a, b in itertools.combinations(df.columns, 2)})

    combos_3 = pd.DataFrame({'{}_{}_{}'.format(a, b, c):
    df[a] + df[b] + df[c] 
    for a, b, c in itertools.combinations(df.columns, 3)})

    composites = pd.concat([df, combos_2, combos_3], axis=1)

However, I can't figure out how to extend this code in a pythonic way to account for a DataFrame with a much larger number of columns. Is there a way of making the following code more pythonic and extending it for use with a large number of columns? Or is there a more efficient way of generating the combinations?

Upvotes: 2

Views: 3017

Answers (2)

Eugene Pakhomov
Eugene Pakhomov

Reputation: 10632

You were pretty close:

from itertools import chain, combinations

# Need to realize the generator to make sure that we don't
# read columns from the altered dataframe.
combs = list(chain.from_iterable(combinations(d.columns, i)
                                 for i in range(2, len(d.columns) + 1)))
for cols in combs:
    df['_'.join(cols)] = df.loc[:, cols].sum(axis=1)

A word of precaution - if you combine columns with _ while the column names themselves can contain _, you're bound to have column name clashes sooner or later.

Upvotes: 1

BENY
BENY

Reputation: 323226

We need first create the combination based on the columns , then create the dataframe

from itertools import combinations
input = df.columns
output = sum([list(map(list, combinations(input, i))) for i in range(len(input) + 1)], [])
output
Out[21]: [[], ['A'], ['B'], ['C'], ['A', 'B'], ['A', 'C'], ['B', 'C'], ['A', 'B', 'C']]
df1=pd.DataFrame({'_'.join(x) : df[x].sum(axis=1 ) for x in output if x !=[]})
df1
Out[22]: 
   A  B  C  A_B  A_C  B_C  A_B_C
0  1  3  3    4    4    6      7
1  1  9  4   10    5   13     14
2  4  6  3   10    7    9     13

Upvotes: 3

Related Questions