All unique column combinations in a dataframe

Question

thanks for reading.

I'm trying to create all possible unique combinations of columns in a dataframe. So, having columns A, B, C and D, the combinations would be AB, AC, AD, BC, BD, ABC, ABD.

A   B   C   D   AB   AC   AD ...
1   1   3   2   2    4    3

To accomplish this, I created a for loop:

for i, comb in enumerate(df_p.columns):
    for comb2 in df_p.columns[i:]:
        if (comb != comb2) & (comb not in comb2)):
            df_p[comb + ' + ' + comb2] = df_p[comb].astype('str') + ' + ' + df_p[comb2].astype("str")
            print(" comb: " + comb + " combines with comb2: " + comb2)

Basically the "comb" iterator starts in the first column (A), and the second iterator "comb2" starts the second column (B), creating AB, and moving on until all A combinations are created. Then, when comb goes to B, comb2 starts at C, and so on. The if conditions prevent things like A + A as well as A + BA (some errors I was having when testing this with a couple more columns in the df).

My problem now is regarding the reversed duplicates, like having "ABD" being created when iterator one is at letter A (and iterator two combines it with all columns) as well as "DBA" when iterator one is at D and iterator two does all combinations.

In my research I have tried using itertools combinations as well, like this: set(itertools.combinations(df_p.columns, 2)) for combinations of 2 and so forth for every other possible combination, but then I was having troubles "mapping" the newly created column combinations (like AB) with the row values of my original df (which would be the row values of A + row values of B for this example).

I prefer the itertools option, as it allows for more control on how many combinations we want, and probably it is not so hard to map. Any thoughts?

Thank's in advance.

----------------------------------UPDATE-----------------------------------------

Just to clear things, I forgot to mention that the rows are strings. Here is a snippet of the real columns:

retired     nationality     region
   1         Portugal       Lisbon

So creating all combinations of just these 3 for example would be:

retired  nationality  region  retired + nationality   retired + region   (..)
   1      Portugal    Lisbon      1 + Portugal           1 + Lisbon

ansev · Accepted Answer

IIUC, combinations and reduce with Series.add

from itertools import combinations
from functools import reduce

cols = df.columns.copy()
for i in range(2, len(cols) + 1):
    for names in combinations(cols, i):
        df[''.join(names)] = reduce(lambda cum_serie, new_serie_name: \
                                    cum_serie.add(df[new_serie_name]),
                                    names[1:],
                                    df[names[0]])


print(df)

Output

   A  B  C  D  AB  AC  AD  BC  BD  CD  ABC  ABD  ACD  BCD  ABCD
0  1  1  3  2   2   4   3   4   3   5    5    4    6    6     7

EDIT

df = df.rename(columns=str).astype(str)
cols = df.columns.copy()
for i in range(2, len(cols) + 1):
    for names in combinations(cols, i):
        df[' + '.join(names)] = reduce(lambda cum_serie, new_serie_name: \
                                    cum_serie.str.cat(df[new_serie_name], ' + '),
                                    names[1:],
                                    df[names[0]])
print(df)

   A  B  C  D  A + B  A + C  A + D  B + C  B + D  C + D  A + B + C  A + B + D  \
0  1  1  3  2  1 + 1  1 + 3  1 + 2  1 + 3  1 + 2  3 + 2  1 + 1 + 3  1 + 1 + 2   

   A + C + D  B + C + D  A + B + C + D  
0  1 + 3 + 2  1 + 3 + 2  1 + 1 + 3 + 2

All unique column combinations in a dataframe

Answers (2)

Related Questions