rischan
rischan

Reputation: 1585

How to compute pairwise matrix from all pandas columns

Consider I have dataframe:

data = [[11, 10, 13], [16, 15, 45], [35, 14,9]] 
df = pd.DataFrame(data, columns = ['A', 'B', 'C']) 
df 

The data looks like:

    A   B   C
0   11  10  13
1   16  15  45
2   35  14  9

The real data consists of a hundred columns and thousand rows.

I have a function, the aim of the function is to count how many values that higher than the minimum value of another column. The function looks like this:

def get_count_higher_than_min(df, column_name_string, df_col_based):
    seriesObj = df.apply(lambda x: True if x[column_name_string] > df_col_based.min(skipna=True) else False, axis=1)
    numOfRows = len(seriesObj[seriesObj == True].index)
    return numOfRows

Example output from the function like this:

get_count_higher_than_min(df, 'A', df['B'])

The output is 3. That is because the minimum value of df['B'] is 10 and three values from df['A'] are higher than 10, so the output is 3.

The problem is I want to compute the pairwise of all columns using that function

I don't know what an effective and efficient way to solve this issue. I want the output in the form of a similar to confusion matrix or similar to correlation matrix.

Example output:

    A   B   C
A   X  3  X
B   X  X  X
C   X  X  X

Upvotes: 2

Views: 539

Answers (2)

Epsi95
Epsi95

Reputation: 9047

from itertools import product
pairs = product(df.columns, repeat=2)

min_value = {}
output = []


for each_pair in pairs:
    # making sure that we are calculating min only once
    min_ = min_value.get(each_pair[1], df[each_pair[1]].min())
    min_value[each_pair[1]] = min_
    
    count = df[df[each_pair[0]]>min_][each_pair[0]].count()
    output.append(count)
    
df_desired = pd.DataFrame(
    [output[i: i+len(df.columns)] for i in range(0, len(output), len(df.columns))], 
    columns=df.columns, index=df.columns)

print(df_desired)
   A  B  C
A  2  3  3
B  2  2  3
C  2  2  2

Upvotes: 1

orlp
orlp

Reputation: 117771

This is O(n2m) where n is the number of columns and m the number of rows.

minima = df.min()
m = pd.DataFrame({c: (df > minima[c]).sum()
                  for c in df.columns})

Result:

>>> m
   A  B  C
A  2  3  3
B  2  2  3
C  2  2  2

In theory O(n log(n) m) is possible.

Upvotes: 1

Related Questions