Count matching combinations in a pandas dataframe

Question

I need to find a more efficient solution for the following problem:

Given is a dataframe with 4 variables in each row. I need to find the list of 8 elements that includes all the variables per row in a maximum amount of rows.

A working, but very slow, solution is to create a second dataframe containing all possible combinations (basically a permutation without repetation). Then loop through every combination and compare it wit the inital dataframe. The amount of solutions is counted and added to the second dataframe.

import numpy as np
import pandas as pd
from itertools import combinations


df = pd.DataFrame(np.random.randint(0,20,size=(100, 4)), columns=list('ABCD'))
df = 'x' + df.astype(str)
listofvalues = df['A'].tolist()
listofvalues.extend(df['B'].tolist())
listofvalues.extend(df['C'].tolist())
listofvalues.extend(df['D'].tolist())
listofvalues = list(dict.fromkeys(listofvalues))
possiblecombinations = list(combinations(listofvalues, 6))
dfcombi = pd.DataFrame(possiblecombinations, columns = ['M','N','O','P','Q','R'])
dfcombi['List'] = dfcombi.M.map(str) + ',' + dfcombi.N.map(str) + ',' + dfcombi.O.map(str) + ',' + dfcombi.P.map(str) + ',' + dfcombi.Q.map(str) + ',' + dfcombi.R.map(str)
dfcombi['Count'] = ''
for x, row in dfcombi.iterrows():
        comparelist =  row['List'].split(',')
        pointercounter = df.index[(df['A'].isin(comparelist) == True) & (df['B'].isin(comparelist) == True) & (df['C'].isin(comparelist) == True) & (df['D'].isin(comparelist) == True)].tolist()
        row['Count'] = len(pointercounter)

I assume there must be a way to avoid the for - loop and replace it with some pointer, i just can not figure out how.

Thanks!

Quang Hoang · Accepted Answer

Your code can be rewritten as:

# working with integers are much better than strings
enums, codes = df.stack().factorize()

# encodings of df
s = [set(x) for x in enums.reshape(-1,4)]

# possible combinations
from itertools import combinations, product
possiblecombinations = np.array([set(x) for x in combinations(range(len(codes)), 6)])

# count the combination with issubset
ret = [0]*len(possiblecombinations)
for a, (i,b) in product(s, enumerate(possiblecombinations)):
    ret[i] += a.issubset(b)

# the combination with maximum count
max_combination = possiblecombinations[np.argmax(ret)]
# in code {0, 3, 4, 5, 17, 18}

# and in values: 
codes[list(max_combination)]
# Index(['x5', 'x15', 'x12', 'x8', 'x0', 'x6'], dtype='object')

All that took about 2 seconds as oppose to your code that took around 1.5 mins.

Count matching combinations in a pandas dataframe

Answers (1)

Related Questions