Brie MerryWeather
Brie MerryWeather

Reputation: 70

Calculate correlation on dict type variables

I have a dataframe named hyperparam_df which looks like the following:


      repo_name                                      file_name  \
0     DeepCoMP                    deepcomp/util/simulation.py   
1     DeepCoMP                     deepcomp/util/constants.py   
2     DeepCoMP                     deepcomp/util/env_setup.py   
3     DeepCoMP                           deepcomp/util/cli.py   
4          cm3                                alg/networks.py   

7154      flow             flow/envs/multiagent/ring/accel.py   
7155      flow  flow/envs/multiagent/ring/wave_attenuation.py   
7156      flow                        flow/envs/ring/accel.py   
7157      flow            flow/envs/ring/lane_change_accel.py   
7158      flow             flow/envs/ring/wave_attenuation.py   

                   hyperparam_name  
0     {'agent_str': 'multi-sep-nns', 'row': 2, 'trai...  
1     {'LOG_ROUND_DIGITS': 3, 'EPSILON': 1e-16, 'FAI...  
2                        {'id': 1, 'log_metrics': True}  
3     {'agent': 'central', 'alg': 'ppo', 'workers': ...  
4                                    {'embed_dim': 128}  

7154  {'max_accel': 1, 'max_decel': 1, 'target_veloc...  
7155  {'max_accel': 1, 'max_decel': 1, 'max_speed': ...  
7156  {'max_accel': 3, 'max_decel': 3, 'target_veloc...  
7157  {'max_accel': 3, 'max_decel': 3, 'lane_change_...  
7158  {'max_accel': 1, 'max_decel': 1, 'v0': 30, 's0...  

My data consists of multiple repositories (repo_name), each repository has multiple files (file_name) and hyperparam_name defines the hyperparameter configuration for every file.

I am trying to find the correlation between the hyperparameters, but I am a bit unsure on how should I breakdown the hyperparam_name column into seperate columns? I would also need to convert the categorical hyperparameters into numerical ones. I haven't dealt with such a scenario before, so not so sure how to go about this. Any suggestions or ideas on how I could do this will be appreciated!

Upvotes: 1

Views: 70

Answers (1)

assaabriiii
assaabriiii

Reputation: 47

1st) Expand the hyperparam_name Column

import pandas as pd

# Assuming hyperparam_df is your dataframe
expanded_df = pd.json_normalize(hyperparam_df['hyperparam_name'])

# Concatenate the expanded columns with the original dataframe
hyperparam_df = pd.concat([hyperparam_df.drop(columns=['hyperparam_name']), expanded_df], axis=1)

2nd) Handle Categorical Variables

from sklearn.preprocessing import LabelEncoder

# Identify categorical columns
categorical_columns = hyperparam_df.select_dtypes(include=['object']).columns

# Apply one-hot encoding or label encoding
for col in categorical_columns:
    if hyperparam_df[col].nunique() > 10:  # Example threshold for using label encoding
        le = LabelEncoder()
        hyperparam_df[col] = le.fit_transform(hyperparam_df[col])
    else:
        hyperparam_df = pd.get_dummies(hyperparam_df, columns=[col], prefix=[col])

3rd) Handle Missing Values

# Fill missing values with a default value, e.g., 0
hyperparam_df = hyperparam_df.fillna(0)

# Alternatively, drop rows with missing values
# hyperparam_df = hyperparam_df.dropna()

4th) Compute Correlation

correlation_matrix = hyperparam_df.corr()

# Optionally, visualize the correlation matrix using a heatmap
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.show()

for example:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming hyperparam_df is your dataframe
expanded_df = pd.json_normalize(hyperparam_df['hyperparam_name'])
hyperparam_df = pd.concat([hyperparam_df.drop(columns=['hyperparam_name']), expanded_df], axis=1)

# Identify categorical columns
categorical_columns = hyperparam_df.select_dtypes(include=['object']).columns

# Apply one-hot encoding or label encoding
for col in categorical_columns:
    if hyperparam_df[col].nunique() > 10:  # Example threshold for using label encoding
        le = LabelEncoder()
        hyperparam_df[col] = le.fit_transform(hyperparam_df[col])
    else:
        hyperparam_df = pd.get_dummies(hyperparam_df, columns=[col], prefix=[col])

# Fill missing values with a default value, e.g., 0
hyperparam_df = hyperparam_df.fillna(0)

# Compute correlation matrix
correlation_matrix = hyperparam_df.corr()

# Visualize the correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.show()

Upvotes: 0

Related Questions