What is the best practice to calculate global frequency of list of elements with exact orders in python within multiple pandas dataframe?

Question

Let's say I have the following datafarme df1 corresponding to user1:

+-------------------+-------+--------+-------+-------+----------+----------------+
|      Models       |  MAE  |  MSE   | RMSE  | MAPE  | R² score |  Runtime [ms]  |
+-------------------+-------+--------+-------+-------+----------+----------------+
| LinearRegression  | 4.906 | 27.784 | 5.271 | 0.405 |  -6.917  | 0:00:43.387145 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|   Random Forest   | 2.739 | 10.239 |  3.2  | 0.231 |  -1.917  | 0:28:11.761681 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|      XGBoost      | 2.826 | 10.898 | 3.301 | 0.234 |  -2.105  | 0:03:58.883474 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|   MLPRegressor    | 5.234 | 30.924 | 5.561 | 0.43  |  -7.812  | 0:01:44.252276 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|        SVR        | 5.061 | 29.301 | 5.413 | 0.417 |  -7.349  | 0:04:52.754769 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| CatBoostRegressor | 2.454 | 8.823  | 2.97  | 0.201 |  -1.514  | 0:19:36.925169 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|   LGBMRegressor   | 2.76  | 10.204 | 3.194 | 0.231 |  -1.907  | 0:04:51.223103 |
+-------------------+-------+--------+-------+-------+----------+----------------+


+-------------------+----------------------------------------------------------------------------------------------------------+
|      Rank         |                                                          MAE                                             | 
+-------------------+----------------------------------------------------------------------------------------------------------+
| Top models(sorted)| ["CatBoostRegressor","RandomForest","LGBMRegressor", "XGBoost","LinearRegression","SVR","MLPRegressor"]  |  
+-------------------+----------------------------------------------------------------------------------------------------------+

I have following datafarme df2 corresponding to user2:

+-------------------+-------+--------+-------+-------+----------+----------------+
|      Models       |  MAE  |  MSE   | RMSE  | MAPE  | R² score |  Runtime [ms]  |
+-------------------+-------+--------+-------+-------+----------+----------------+
| LinearRegression  | 4.575 | 24.809 | 4.981 | 0.377 |  -6.079  | 0:00:45.055854 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|   Random Forest   | 2.345 | 8.065  | 2.84  | 0.199 |  -1.301  | 0:10:55.468473 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|      XGBoost      | 2.129 | 7.217  | 2.686 | 0.179 |  -1.059  | 0:01:01.575033 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|   MLPRegressor    | 4.414 | 23.477 | 4.845 | 0.363 |  -5.699  | 0:00:31.231719 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|        SVR        | 4.353 | 22.826 | 4.778 | 0.357 |  -5.513  | 0:02:12.258870 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| CatBoostRegressor | 2.281 | 7.671  | 2.77  | 0.189 |  -1.189  | 0:08:16.526615 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|   LGBMRegressor   | 2.511 |  9.18  | 3.03  | 0.212 |  -1.619  | 0:15:25.084937 |
+-------------------+-------+--------+-------+-------+----------+----------------+


+-------------------+----------------------------------------------------------------------------------------------------------+
|      Rank         |                                                          MAE                                             | 
+-------------------+----------------------------------------------------------------------------------------------------------+
| Top models(sorted)| ["XGBoost","CatBoostRegressor","RandomForest","LGBMRegressor","LinearRegression","SVR","MLPRegressor"]  |  
+-------------------+----------------------------------------------------------------------------------------------------------+

Let's say I have more datafarmes df1000 corresponding to user1000.

Problem statement: I want to count how often each ranking order occurs across all users (for a given metric). (And then, sort the ranking orders by their counts, and, additionally, compute the percentage of how often each particular ranking order occurs (based on the counts).)

I want to rank Models result (sorted over a specific column (e.g. MAE ) iteratively and return the frequency of top models over all dfs (df1 till df1000). so this is not something I can easily reach using the:

df["category"].value_counts()

we are interested in computing absolute\relative frequencies in final ranked table in expected output. so definitely I need to transform and add the list of sorted models' names that'd be a list of strings.

Possible transformation or aggregation stages from my understanding:

take each df and create the list of sorted model names based desired column or metric: ['model2','model7', 'model6', 'model5', 'model4', 'model3', 'model1' ]
including the name of Users in the final transformed dataframe could also be useful (however I did not mention it in the following table in the expected output.)
computing absolute\relative frequencies and return as counts and freq(%) in final table

Expected output:

+-------------------+----------------------------------------------------------------------------------------------------------+--------+---------+
|      Rank         |                                                    MAE                                                  |counts  |freq(%)  |
+-------------------+----------------------------------------------------------------------------------------------------------+--------+---------+
| Top models(sorted)| ["CatBoostRegressor","RandomForest","LGBMRegressor", "XGBoost","LinearRegression","SVR","MLPRegressor"]   | 70     |   65%   |
| Top models(sorted)| ["XGBoost","CatBoostRegressor","RandomForest","LGBMRegressor","LinearRegression","SVR","MLPRegressor"]   | 20     |   12%   |
| Top models(sorted)|                                             ....                                                  | ....     |   ....  |
....
+-------------------+----------------------------------------------------------------------------------------------------------+--------+---------+

I also was thinking maybe I can use Natural Language Processing (NLP) methods called TF-IDF to handle this problem using:

# import required module
from sklearn.feature_extraction.text import TfidfVectorizer

Potentially related posts I have checked:

Sachin Hosmani · Accepted Answer

Does this work?

import pandas as pd
from collections import Counter

# I've left out the other metrics because they are irrelevant to your question, but you can add them
data1 = {
    'Models': ['LinearRegression', 'Random Forest', 'XGBoost', 'MLPRegressor', 'SVR', 'CatBoostRegressor', 'LGBMRegressor'],
    'MAE': [4.906, 2.739, 2.826, 5.234, 5.061, 2.454, 2.76]
}
data2 = {
    'Models': ['LinearRegression', 'Random Forest', 'XGBoost', 'MLPRegressor', 'SVR', 'CatBoostRegressor', 'LGBMRegressor'],
    'MAE': [4.575, 2.345, 2.129, 4.414, 4.353, 2.281, 2.511]
}
data3 = {
    'Models': ['LinearRegression', 'Random Forest', 'XGBoost', 'MLPRegressor', 'SVR', 'CatBoostRegressor', 'LGBMRegressor'],
    'MAE': [4.575, 2.345, 2.129, 4.414, 4.353, 2.281, 2.511]
}


df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)

# add your 1000 data frames here if you want
dataframes = [df1, df2, df3]

concatenated_models_list = []

for df in dataframes:
    df_sorted = df.sort_values(by='MAE', ascending=False)
    # Concatenate the sorted-by-MAE model names into one string, so that
    # you can group by it as a key later
    concatenated_models = ','.join(df_sorted['Models'].tolist())
    concatenated_models_list.append(concatenated_models)

# Union everything into a single dataframe
union_df = pd.DataFrame(concatenated_models_list, columns=['Top models(sorted)'])

grouped_df = union_df['Top models(sorted)'].value_counts().reset_index()
grouped_df.columns = ['Top models(sorted)', 'count']

grouped_df['freq(%)'] = (grouped_df['count'] / len(dataframes)) * 100

grouped_df

Output

Top models(sorted)  count   freq(%)
LinearRegression,MLPRegressor,SVR,LGBMRegresso...   2   66.666667
MLPRegressor,SVR,LinearRegression,XGBoost,LGBM...   1   33.333333

What is the best practice to calculate global frequency of list of elements with exact orders in python within multiple pandas dataframe?

Answers (2)

Related Questions