How to list the most frequent combination of column that contain data

Question

Hellooo,

I am working with geological datasets which are famously messy and disparate. What I am looking to do is: output a list of column combination with the highest number of NaN-free rows for a certain number of columns.

e.g.

A B C D E F
2 6 3 7 7 3 
4 5 6 7 5 4 
3 4 x x x x 
4 5 x x x x 
6 7 x x x x
x x x 5 6 7 
x x x 7 5 8

If I input 2, then it would return a list with ['A','B'] and 5, because the A and B columns have 5 complete rows. If I input 3, return ['D','E','F'] and 4, because they have 4 complete rows. If I input 5, then I would get ['A','B','C','D','E','F'] and 2 because they have 2 full rows.

Thanks a bunch in advance!

James · Accepted Answer

It think this is what you want. Instead of returning a list of columns, this returns a list or lists of columns, to account for instances where there is a tie for the 'best' number of non-NA rows.

import pandas as pd
from itertools import combinations
from math import nan

def best_combinations(df, n_cols):
    best_cols = []
    best_length = 0
    for cols in combinations(df.columns, n_cols):
        subdf = df.loc[:, list(cols)].dropna()
        if len(subdf) > best_length:
            best_length = len(subdf)
            best_cols = [cols]
        elif (len(subdf) == best_length) and (best_length > 0):
            best_cols.append(cols)
    return best_cols, best_length

On your dataframe:

df = pd.DataFrame({
 'A': {0: '2', 1: '4', 2: '3', 3: '4', 4: '6', 5: nan, 6: nan},
 'B': {0: '6', 1: '5', 2: '4', 3: '5', 4: '7', 5: nan, 6: nan},
 'C': {0: '3', 1: '6', 2: nan, 3: nan, 4: nan, 5: nan, 6: nan},
 'D': {0: '7', 1: '7', 2: nan, 3: nan, 4: nan, 5: '5', 6: '7'},
 'E': {0: '7', 1: '5', 2: nan, 3: nan, 4: nan, 5: '6', 6: '5'},
 'F': {0: '3', 1: '4', 2: nan, 3: nan, 4: nan, 5: '7', 6: '8'}}
)

best_combinations(df, 2)
# returns:
[('A', 'B')], 5

best_combinations(df, 3)
[('D', 'E', 'F')], 4

How to list the most frequent combination of column that contain data

Answers (1)

Related Questions