Jan
Jan

Reputation: 747

How to Calculate Accuracy of multiple Pandas dataframe with multiple columns

I have multiple pandas dataframes as follow:

data1 = {'1':[4], '2':[2], '3':[6]}
baseline = pd.DataFrame(data1)

 # baseline output  
   1  2  3
0  4  2  6

data2 = {'1':[3], '2':[5], '5':[5]}
forecast1 = pd.DataFrame(data2)

# forecast1 output
   1  2  5
0  3  5  5

data3 = {'1':[2], '3':[4], '5':[5], '6':[2]}
forecast2 = pd.DataFrame(data3)

# forecast2 output
   1  3  5  6
0  2  4  5  2

How can I calculate the accuracy (or confusion matrix) of forecast1 and forecast2 (separately) compared to the baseline dataframe (i.e. baseline vs forecast1 and baseline vs forecast2) ?

please also note that forecast1 and forecast2 might have some extra columns compared to the baseline dataframe. So, the accuracy calculation needs to consider the numbers of the available columns and deal with the extra columns as well. Is there a way to deal with such case?

These dataframes are a result of data cleaning that I am doing and that is why some of them have few extra columns not available in the baseline dataframe.

I appreciate your help.

Thanks.

Upvotes: 0

Views: 10379

Answers (1)

Zabir Al Nazi Nabil
Zabir Al Nazi Nabil

Reputation: 11208

print(baseline.columns)
print(forecast1.columns)
print(forecast2.columns)
Index(['1', '2', '3'], dtype='object')
Index(['1', '2', '5'], dtype='object')
Index(['1', '3', '5', '6'], dtype='object')

You can take the intersection of the columns to find out which columns are common between baseline and forecast and just apply accuracy_score on those columns.

from sklearn.metrics import accuracy_score

common_columns = list(set(baseline.columns).intersection(forecast1.columns))

avg_acc = 0.0
for c in common_columns:
    c_acc = accuracy_score(baseline[c], forecast1[c])
    print(f'Column {c} acc: {c_acc}')
    avg_acc += c_acc/len(common_columns)

print(avg_acc)

Write a function to take baseline and a forecast to give you accuracy.

from sklearn.metrics import accuracy_score

def calc_acc(baseline, forecast1):
    common_columns = list(set(baseline.columns).intersection(forecast1.columns))

    avg_acc = 0.0
    for c in common_columns:
        c_acc = accuracy_score(baseline[c], forecast1[c])
        print(f'Column {c} acc: {c_acc}')
        avg_acc += c_acc/len(common_columns)

    print(avg_acc)
    return avg_acc
from sklearn.metrics import accuracy_score

def calc_acc(baseline, forecast1):
    penalize = True
    common_columns = list(set(baseline.columns).intersection(forecast1.columns))

    avg_acc = 0.0
    for c in common_columns:
        c_acc = accuracy_score(baseline[c], forecast1[c])
        print(f'Column {c} acc: {c_acc}')
        if penalize:
            div = len(common_columns) + abs(len(forecast1.columns) - len(baseline.columns)) # it will penalize for both having more or less columns than baseline, you can change it based on your needs
            avg_acc += c_acc/div
        else:
            avg_acc += c_acc/len(common_columns)

    print(avg_acc)
    return avg_acc

For regression try mean absolute error, the lower the error the best the prediction is.

from sklearn.metrics import accuracy_score, mean_absolute_error

def calc_acc(baseline, forecast1):
    penalize = True
    common_columns = list(set(baseline.columns).intersection(forecast1.columns))

    avg_acc = 0.0
    for c in common_columns:
        c_acc = mean_absolute_error(baseline[c], forecast1[c])
        print(f'Column {c} mean absolute error: {c_acc}')
        if penalize:
            div = len(common_columns) + abs(len(forecast1.columns) - len(baseline.columns)) # it will penalize for both having more or less columns than baseline, you can change it based on your needs
            avg_acc += c_acc/div
        else:
            avg_acc += c_acc/len(common_columns)

    print(avg_acc)
    return avg_acc

Usually, mean percentage correct is approximately 100% - mean error. So, you can just subtract the error from 100%.

def perc(a_list, b_list):
    ans = 0.0

    for i in range(len(a_list)):
        ans += (1. - abs(a_list[i]-b_list[i])/a_list[i])

    return ans

from sklearn.metrics import accuracy_score, mean_absolute_error

def calc_acc(baseline, forecast1):
    penalize = True
    common_columns = list(set(baseline.columns).intersection(forecast1.columns))

    avg_acc = 0.0
    for c in common_columns:
        c_acc = perc(baseline[c], forecast1[c])
        print(f'Column {c} mean percentange correct: {c_acc}')
        if penalize:
            div = len(common_columns) + abs(len(forecast1.columns) - len(baseline.columns)) # it will penalize for both having more or less columns than baseline, you can change it based on your needs
            avg_acc += c_acc/div
        else:
            avg_acc += c_acc/len(common_columns)

    print(avg_acc)
    return avg_acc

Upvotes: 4

Related Questions