Mario
Mario

Reputation: 1976

What is the best practice to calculate global frequency of list of elements with exact orders in python within multiple pandas dataframe?

Let's say I have the following datafarme df1 corresponding to user1:

+-------------------+-------+--------+-------+-------+----------+----------------+
|      Models       |  MAE  |  MSE   | RMSE  | MAPE  | R² score |  Runtime [ms]  |
+-------------------+-------+--------+-------+-------+----------+----------------+
| LinearRegression  | 4.906 | 27.784 | 5.271 | 0.405 |  -6.917  | 0:00:43.387145 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|   Random Forest   | 2.739 | 10.239 |  3.2  | 0.231 |  -1.917  | 0:28:11.761681 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|      XGBoost      | 2.826 | 10.898 | 3.301 | 0.234 |  -2.105  | 0:03:58.883474 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|   MLPRegressor    | 5.234 | 30.924 | 5.561 | 0.43  |  -7.812  | 0:01:44.252276 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|        SVR        | 5.061 | 29.301 | 5.413 | 0.417 |  -7.349  | 0:04:52.754769 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| CatBoostRegressor | 2.454 | 8.823  | 2.97  | 0.201 |  -1.514  | 0:19:36.925169 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|   LGBMRegressor   | 2.76  | 10.204 | 3.194 | 0.231 |  -1.907  | 0:04:51.223103 |
+-------------------+-------+--------+-------+-------+----------+----------------+


+-------------------+----------------------------------------------------------------------------------------------------------+
|      Rank         |                                                          MAE                                             | 
+-------------------+----------------------------------------------------------------------------------------------------------+
| Top models(sorted)| ["CatBoostRegressor","RandomForest","LGBMRegressor", "XGBoost","LinearRegression","SVR","MLPRegressor"]  |  
+-------------------+----------------------------------------------------------------------------------------------------------+

I have following datafarme df2 corresponding to user2:

+-------------------+-------+--------+-------+-------+----------+----------------+
|      Models       |  MAE  |  MSE   | RMSE  | MAPE  | R² score |  Runtime [ms]  |
+-------------------+-------+--------+-------+-------+----------+----------------+
| LinearRegression  | 4.575 | 24.809 | 4.981 | 0.377 |  -6.079  | 0:00:45.055854 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|   Random Forest   | 2.345 | 8.065  | 2.84  | 0.199 |  -1.301  | 0:10:55.468473 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|      XGBoost      | 2.129 | 7.217  | 2.686 | 0.179 |  -1.059  | 0:01:01.575033 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|   MLPRegressor    | 4.414 | 23.477 | 4.845 | 0.363 |  -5.699  | 0:00:31.231719 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|        SVR        | 4.353 | 22.826 | 4.778 | 0.357 |  -5.513  | 0:02:12.258870 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| CatBoostRegressor | 2.281 | 7.671  | 2.77  | 0.189 |  -1.189  | 0:08:16.526615 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|   LGBMRegressor   | 2.511 |  9.18  | 3.03  | 0.212 |  -1.619  | 0:15:25.084937 |
+-------------------+-------+--------+-------+-------+----------+----------------+


+-------------------+----------------------------------------------------------------------------------------------------------+
|      Rank         |                                                          MAE                                             | 
+-------------------+----------------------------------------------------------------------------------------------------------+
| Top models(sorted)| ["XGBoost","CatBoostRegressor","RandomForest","LGBMRegressor","LinearRegression","SVR","MLPRegressor"]  |  
+-------------------+----------------------------------------------------------------------------------------------------------+

Let's say I have more datafarmes df1000 corresponding to user1000.

Problem statement: I want to count how often each ranking order occurs across all users (for a given metric). (And then, sort the ranking orders by their counts, and, additionally, compute the percentage of how often each particular ranking order occurs (based on the counts).)

I want to rank Models result (sorted over a specific column (e.g. MAE ) iteratively and return the frequency of top models over all dfs (df1 till df1000). so this is not something I can easily reach using the:

df["category"].value_counts()

we are interested in computing absolute\relative frequencies in final ranked table in expected output. so definitely I need to transform and add the list of sorted models' names that'd be a list of strings.

Possible transformation or aggregation stages from my understanding:

  1. take each df and create the list of sorted model names based desired column or metric: ['model2','model7', 'model6', 'model5', 'model4', 'model3', 'model1' ]
  2. including the name of Users in the final transformed dataframe could also be useful (however I did not mention it in the following table in the expected output.)
  3. computing absolute\relative frequencies and return as counts and freq(%) in final table

Expected output:

+-------------------+----------------------------------------------------------------------------------------------------------+--------+---------+
|      Rank         |                                                    MAE                                                  |counts  |freq(%)  |
+-------------------+----------------------------------------------------------------------------------------------------------+--------+---------+
| Top models(sorted)| ["CatBoostRegressor","RandomForest","LGBMRegressor", "XGBoost","LinearRegression","SVR","MLPRegressor"]   | 70     |   65%   |
| Top models(sorted)| ["XGBoost","CatBoostRegressor","RandomForest","LGBMRegressor","LinearRegression","SVR","MLPRegressor"]   | 20     |   12%   |
| Top models(sorted)|                                             ....                                                  | ....     |   ....  |
....
+-------------------+----------------------------------------------------------------------------------------------------------+--------+---------+

I also was thinking maybe I can use Natural Language Processing (NLP) methods called TF-IDF to handle this problem using:

# import required module
from sklearn.feature_extraction.text import TfidfVectorizer

Potentially related posts I have checked:

Upvotes: 1

Views: 158

Answers (2)

Sachin Hosmani
Sachin Hosmani

Reputation: 1762

Does this work?

import pandas as pd
from collections import Counter

# I've left out the other metrics because they are irrelevant to your question, but you can add them
data1 = {
    'Models': ['LinearRegression', 'Random Forest', 'XGBoost', 'MLPRegressor', 'SVR', 'CatBoostRegressor', 'LGBMRegressor'],
    'MAE': [4.906, 2.739, 2.826, 5.234, 5.061, 2.454, 2.76]
}
data2 = {
    'Models': ['LinearRegression', 'Random Forest', 'XGBoost', 'MLPRegressor', 'SVR', 'CatBoostRegressor', 'LGBMRegressor'],
    'MAE': [4.575, 2.345, 2.129, 4.414, 4.353, 2.281, 2.511]
}
data3 = {
    'Models': ['LinearRegression', 'Random Forest', 'XGBoost', 'MLPRegressor', 'SVR', 'CatBoostRegressor', 'LGBMRegressor'],
    'MAE': [4.575, 2.345, 2.129, 4.414, 4.353, 2.281, 2.511]
}


df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)

# add your 1000 data frames here if you want
dataframes = [df1, df2, df3]

concatenated_models_list = []

for df in dataframes:
    df_sorted = df.sort_values(by='MAE', ascending=False)
    # Concatenate the sorted-by-MAE model names into one string, so that
    # you can group by it as a key later
    concatenated_models = ','.join(df_sorted['Models'].tolist())
    concatenated_models_list.append(concatenated_models)

# Union everything into a single dataframe
union_df = pd.DataFrame(concatenated_models_list, columns=['Top models(sorted)'])

grouped_df = union_df['Top models(sorted)'].value_counts().reset_index()
grouped_df.columns = ['Top models(sorted)', 'count']

grouped_df['freq(%)'] = (grouped_df['count'] / len(dataframes)) * 100

grouped_df

Output

Top models(sorted)  count   freq(%)
LinearRegression,MLPRegressor,SVR,LGBMRegresso...   2   66.666667
MLPRegressor,SVR,LinearRegression,XGBoost,LGBM...   1   33.333333

Upvotes: 2

e-motta
e-motta

Reputation: 7530

IIUC, to get your proposed result you can use this approach:

  1. Sort each dataframe by the desired metric, then concatenate all the strings in column 'Models' into a single string. (list is not hashable, so you won't be able to count it as easily.)
  2. Use the concatenated string to create a new dataframe and append it to out_list.
  3. concat all the dataframes in out_list, then get count and frequency.
metric = "MAE"

in_list = [df1, df2]  # list with all the dataframes
out_list = [
    pd.DataFrame(
        [["Top models(sorted)", df.sort_values(metric)["Models"].str.cat(sep=",")]],
        columns=["Rank", metric],
    )
    for df in in_list
]

out = pd.concat(out_list).value_counts().reset_index(name="count")
out["freq(%)"] = out["count"] / out["count"].sum() * 100
                 Rank                                                MAE  count  freq(%)
0  Top models(sorted)  CatBoostRegressor,Random Forest,LGBMRegressor,...      1     50.0
1  Top models(sorted)  XGBoost,CatBoostRegressor,Random Forest,LGBMRe...      1     50.0

Although to rank the models maybe a more sensible approach would be to concatenate all the dataframes, group by Models and get the sum for all Users.

in_list = [df1, df2]  # list with all the dataframes
out = pd.concat([df.assign(user=i) for i, df in enumerate(in_list)])
out = out.groupby("Models")["MAE"].sum().sort_values()
Models
CatBoostRegressor    4.735
XGBoost              4.955
Random Forest        5.084
LGBMRegressor        5.271
SVR                  9.414
LinearRegression     9.481
MLPRegressor         9.648
Name: MAE, dtype: float64

Upvotes: 2

Related Questions