Reputation: 747
I have multiple pandas dataframes as follow:
data1 = {'1':[4], '2':[2], '3':[6]}
baseline = pd.DataFrame(data1)
# baseline output
1 2 3
0 4 2 6
data2 = {'1':[3], '2':[5], '5':[5]}
forecast1 = pd.DataFrame(data2)
# forecast1 output
1 2 5
0 3 5 5
data3 = {'1':[2], '3':[4], '5':[5], '6':[2]}
forecast2 = pd.DataFrame(data3)
# forecast2 output
1 3 5 6
0 2 4 5 2
How can I calculate the accuracy (or confusion matrix) of forecast1 and forecast2 (separately) compared to the baseline dataframe (i.e. baseline vs forecast1 and baseline vs forecast2) ?
please also note that forecast1 and forecast2 might have some extra columns compared to the baseline dataframe. So, the accuracy calculation needs to consider the numbers of the available columns and deal with the extra columns as well. Is there a way to deal with such case?
These dataframes are a result of data cleaning that I am doing and that is why some of them have few extra columns not available in the baseline dataframe.
I appreciate your help.
Thanks.
Upvotes: 0
Views: 10379
Reputation: 11208
print(baseline.columns)
print(forecast1.columns)
print(forecast2.columns)
Index(['1', '2', '3'], dtype='object')
Index(['1', '2', '5'], dtype='object')
Index(['1', '3', '5', '6'], dtype='object')
You can take the intersection of the columns to find out which columns are common between baseline and forecast and just apply accuracy_score on those columns.
from sklearn.metrics import accuracy_score
common_columns = list(set(baseline.columns).intersection(forecast1.columns))
avg_acc = 0.0
for c in common_columns:
c_acc = accuracy_score(baseline[c], forecast1[c])
print(f'Column {c} acc: {c_acc}')
avg_acc += c_acc/len(common_columns)
print(avg_acc)
Write a function to take baseline and a forecast to give you accuracy.
from sklearn.metrics import accuracy_score
def calc_acc(baseline, forecast1):
common_columns = list(set(baseline.columns).intersection(forecast1.columns))
avg_acc = 0.0
for c in common_columns:
c_acc = accuracy_score(baseline[c], forecast1[c])
print(f'Column {c} acc: {c_acc}')
avg_acc += c_acc/len(common_columns)
print(avg_acc)
return avg_acc
from sklearn.metrics import accuracy_score
def calc_acc(baseline, forecast1):
penalize = True
common_columns = list(set(baseline.columns).intersection(forecast1.columns))
avg_acc = 0.0
for c in common_columns:
c_acc = accuracy_score(baseline[c], forecast1[c])
print(f'Column {c} acc: {c_acc}')
if penalize:
div = len(common_columns) + abs(len(forecast1.columns) - len(baseline.columns)) # it will penalize for both having more or less columns than baseline, you can change it based on your needs
avg_acc += c_acc/div
else:
avg_acc += c_acc/len(common_columns)
print(avg_acc)
return avg_acc
For regression try mean absolute error, the lower the error the best the prediction is.
from sklearn.metrics import accuracy_score, mean_absolute_error
def calc_acc(baseline, forecast1):
penalize = True
common_columns = list(set(baseline.columns).intersection(forecast1.columns))
avg_acc = 0.0
for c in common_columns:
c_acc = mean_absolute_error(baseline[c], forecast1[c])
print(f'Column {c} mean absolute error: {c_acc}')
if penalize:
div = len(common_columns) + abs(len(forecast1.columns) - len(baseline.columns)) # it will penalize for both having more or less columns than baseline, you can change it based on your needs
avg_acc += c_acc/div
else:
avg_acc += c_acc/len(common_columns)
print(avg_acc)
return avg_acc
Usually, mean percentage correct is approximately 100% - mean error. So, you can just subtract the error from 100%.
def perc(a_list, b_list):
ans = 0.0
for i in range(len(a_list)):
ans += (1. - abs(a_list[i]-b_list[i])/a_list[i])
return ans
from sklearn.metrics import accuracy_score, mean_absolute_error
def calc_acc(baseline, forecast1):
penalize = True
common_columns = list(set(baseline.columns).intersection(forecast1.columns))
avg_acc = 0.0
for c in common_columns:
c_acc = perc(baseline[c], forecast1[c])
print(f'Column {c} mean percentange correct: {c_acc}')
if penalize:
div = len(common_columns) + abs(len(forecast1.columns) - len(baseline.columns)) # it will penalize for both having more or less columns than baseline, you can change it based on your needs
avg_acc += c_acc/div
else:
avg_acc += c_acc/len(common_columns)
print(avg_acc)
return avg_acc
Upvotes: 4