Reputation: 23
thanks for reading.
I'm trying to create all possible unique combinations of columns in a dataframe. So, having columns A, B, C and D, the combinations would be AB, AC, AD, BC, BD, ABC, ABD.
A B C D AB AC AD ...
1 1 3 2 2 4 3
To accomplish this, I created a for loop:
for i, comb in enumerate(df_p.columns):
for comb2 in df_p.columns[i:]:
if (comb != comb2) & (comb not in comb2)):
df_p[comb + ' + ' + comb2] = df_p[comb].astype('str') + ' + ' + df_p[comb2].astype("str")
print(" comb: " + comb + " combines with comb2: " + comb2)
Basically the "comb" iterator starts in the first column (A), and the second iterator "comb2" starts the second column (B), creating AB, and moving on until all A combinations are created. Then, when comb goes to B, comb2 starts at C, and so on. The if conditions prevent things like A + A as well as A + BA (some errors I was having when testing this with a couple more columns in the df).
My problem now is regarding the reversed duplicates, like having "ABD" being created when iterator one is at letter A (and iterator two combines it with all columns) as well as "DBA" when iterator one is at D and iterator two does all combinations.
In my research I have tried using itertools combinations as well, like this: set(itertools.combinations(df_p.columns, 2))
for combinations of 2 and so forth for every other possible combination, but then I was having troubles "mapping" the newly created column combinations (like AB) with the row values of my original df (which would be the row values of A + row values of B for this example).
I prefer the itertools option, as it allows for more control on how many combinations we want, and probably it is not so hard to map. Any thoughts?
Thank's in advance.
----------------------------------UPDATE-----------------------------------------
Just to clear things, I forgot to mention that the rows are strings. Here is a snippet of the real columns:
retired nationality region
1 Portugal Lisbon
So creating all combinations of just these 3 for example would be:
retired nationality region retired + nationality retired + region (..)
1 Portugal Lisbon 1 + Portugal 1 + Lisbon
Upvotes: 2
Views: 568
Reputation: 30940
IIUC, combinations
and reduce
with Series.add
from itertools import combinations
from functools import reduce
cols = df.columns.copy()
for i in range(2, len(cols) + 1):
for names in combinations(cols, i):
df[''.join(names)] = reduce(lambda cum_serie, new_serie_name: \
cum_serie.add(df[new_serie_name]),
names[1:],
df[names[0]])
print(df)
Output
A B C D AB AC AD BC BD CD ABC ABD ACD BCD ABCD
0 1 1 3 2 2 4 3 4 3 5 5 4 6 6 7
EDIT
df = df.rename(columns=str).astype(str)
cols = df.columns.copy()
for i in range(2, len(cols) + 1):
for names in combinations(cols, i):
df[' + '.join(names)] = reduce(lambda cum_serie, new_serie_name: \
cum_serie.str.cat(df[new_serie_name], ' + '),
names[1:],
df[names[0]])
print(df)
A B C D A + B A + C A + D B + C B + D C + D A + B + C A + B + D \
0 1 1 3 2 1 + 1 1 + 3 1 + 2 1 + 3 1 + 2 3 + 2 1 + 1 + 3 1 + 1 + 2
A + C + D B + C + D A + B + C + D
0 1 + 3 + 2 1 + 3 + 2 1 + 1 + 3 + 2
Upvotes: 2
Reputation: 769
I think using combinations
is the right way to go about it.
First create a list of column combinations:
col_combs = list(combinations(df.columns, 2))
And then to get a df just containing those columns for any given combination, convert the combination tuple into a list, and pass it to the dataframe.
cols = list(col_combs[0]
comb_df = `df[col_combs)]
Below is a minimal example of how to store a separate dataframe for each combination of 2 columns:
col_combs = list(combinations(df.columns, 2))
comb_dfs = []
for cols in col_combs:
temp = df[list(cols)].copy()
comb_dfs.append(temp)
To get it to work for greater combinations of columns, you'd just run several different combinations
with the values you wanted, and gather all the results into one list before making the dataframes.
Upvotes: 2