Eran
Eran

Reputation: 555

Calculating Matthew correlation coefficient for a matrix takes too long

I would like to calculate Matthew correlation coefficient for two matrices A and B. Looping over columns of A, and calculate MCC for that column and all 2000 rows of matrix B, then take the max index. The code is:

import numpy as np
import pandas as pd
from sklearn.metrics import matthews_corrcoef as mcc

A = pd.read_csv('A.csv', squeeze=True)
B = pd.read_csv('B.csv', squeeze=True)

ind = {}
for col in A:
   ind[col] = np.argmax(list(mcc(B.iloc[i], A[col]) for i in range(2000)))
   print(ind[col])

My problem is that it takes really long time (one second for each column). I saw almost the same code in R running much faster (like in 5 seconds). How can this be? Can I improve my Python code?


R Code:

A <- as.matrix(read.csv(file='A.csv'))
B <- t(as.matrix(read.csv(file='B.csv', check.names = FALSE)))
library('mccr')
C <- rep(NA, ncol(A))
for (query in 1:ncol(A)) {
    mcc <- sapply(1:ncol(B), function(i) 
           mccr(A[, query], B[, i]))
    C[query] <- which.max(mcc)
}

Upvotes: 0

Views: 778

Answers (1)

litt_r
litt_r

Reputation: 11

Maybe try this using numpy and dot products in python

def compute_mcc(true_labels, pred_labels):
    """Compute matthew's correlation coefficient.

    :param true_labels: 2D integer array (features x samples)
    :param pred_labels: 2D integer array (features x samples)
    :return: mcc (samples1 x samples2)
    """
    # prep inputs for confusion matrix calculations
    pred_labels_1 = pred_labels == 1; pred_labels_0 = pred_labels == 0
    true_labels_1 = true_labels == 1; true_labels_0 = true_labels == 0
    
    # dot product of binary matrices
    confusion_dot = lambda a,b: np.dot(a.T.astype(int), b.astype(int)).T
    TP = confusion_dot(pred_labels_1, true_labels_1)
    TN = confusion_dot(pred_labels_0, true_labels_0)
    FP = confusion_dot(pred_labels_1, true_labels_0)
    FN = confusion_dot(pred_labels_0, true_labels_1)

    mcc = (TP * TN) - (FP * FN)
    denom = np.sqrt((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))
    
    # avoid dividing by 0
    denom[denom == 0] = 1

    return mcc / denom

Upvotes: 1

Related Questions