Reputation: 23
I have the following DataFrame:
A B C
0 1 3 3
1 1 9 4
2 4 6 3
I would like to create every possible unique combination of these columns without repetition so that I would end up with a dataframe containing the following data: A, B, C, A+B, A+C, B+C, A+B+C. I do not want to have any columns repeated in any combination, e.g. A+A+B+C or A+B+B+C.
I would also like to have each column in the dataframe labelled with the relevant variable names (e.g. for the combination of A + B, column name should be 'A_B')
This is the desired DataFrame:
A B C A_B A_C B_C A_B_C
0 1 1 4 2 5 5 6
1 3 9 6 12 9 15 18
2 3 4 3 7 6 7 10
This is relatively easy with just 3 variables using itertools and I have used the following code to do it:
import pandas as pd
import itertools
combos_2 = pd.DataFrame({'{}_{}'.format(a, b):
df[a] + df[b]
for a, b in itertools.combinations(df.columns, 2)})
combos_3 = pd.DataFrame({'{}_{}_{}'.format(a, b, c):
df[a] + df[b] + df[c]
for a, b, c in itertools.combinations(df.columns, 3)})
composites = pd.concat([df, combos_2, combos_3], axis=1)
However, I can't figure out how to extend this code in a pythonic way to account for a DataFrame with a much larger number of columns. Is there a way of making the following code more pythonic and extending it for use with a large number of columns? Or is there a more efficient way of generating the combinations?
Upvotes: 2
Views: 3017
Reputation: 10632
You were pretty close:
from itertools import chain, combinations
# Need to realize the generator to make sure that we don't
# read columns from the altered dataframe.
combs = list(chain.from_iterable(combinations(d.columns, i)
for i in range(2, len(d.columns) + 1)))
for cols in combs:
df['_'.join(cols)] = df.loc[:, cols].sum(axis=1)
A word of precaution - if you combine columns with _
while the column names themselves can contain _
, you're bound to have column name clashes sooner or later.
Upvotes: 1
Reputation: 323226
We need first create the combination
based on the columns , then create the dataframe
from itertools import combinations
input = df.columns
output = sum([list(map(list, combinations(input, i))) for i in range(len(input) + 1)], [])
output
Out[21]: [[], ['A'], ['B'], ['C'], ['A', 'B'], ['A', 'C'], ['B', 'C'], ['A', 'B', 'C']]
df1=pd.DataFrame({'_'.join(x) : df[x].sum(axis=1 ) for x in output if x !=[]})
df1
Out[22]:
A B C A_B A_C B_C A_B_C
0 1 3 3 4 4 6 7
1 1 9 4 10 5 13 14
2 4 6 3 10 7 9 13
Upvotes: 3