Reputation: 70
I have a dataframe named hyperparam_df
which looks like the following:
repo_name file_name \
0 DeepCoMP deepcomp/util/simulation.py
1 DeepCoMP deepcomp/util/constants.py
2 DeepCoMP deepcomp/util/env_setup.py
3 DeepCoMP deepcomp/util/cli.py
4 cm3 alg/networks.py
7154 flow flow/envs/multiagent/ring/accel.py
7155 flow flow/envs/multiagent/ring/wave_attenuation.py
7156 flow flow/envs/ring/accel.py
7157 flow flow/envs/ring/lane_change_accel.py
7158 flow flow/envs/ring/wave_attenuation.py
hyperparam_name
0 {'agent_str': 'multi-sep-nns', 'row': 2, 'trai...
1 {'LOG_ROUND_DIGITS': 3, 'EPSILON': 1e-16, 'FAI...
2 {'id': 1, 'log_metrics': True}
3 {'agent': 'central', 'alg': 'ppo', 'workers': ...
4 {'embed_dim': 128}
7154 {'max_accel': 1, 'max_decel': 1, 'target_veloc...
7155 {'max_accel': 1, 'max_decel': 1, 'max_speed': ...
7156 {'max_accel': 3, 'max_decel': 3, 'target_veloc...
7157 {'max_accel': 3, 'max_decel': 3, 'lane_change_...
7158 {'max_accel': 1, 'max_decel': 1, 'v0': 30, 's0...
My data consists of multiple repositories (repo_name
), each repository has multiple files (file_name
) and hyperparam_name
defines the hyperparameter configuration for every file.
I am trying to find the correlation between the hyperparameters, but I am a bit unsure on how should I breakdown the hyperparam_name
column into seperate columns? I would also need to convert the categorical hyperparameters into numerical ones. I haven't dealt with such a scenario before, so not so sure how to go about this. Any suggestions or ideas on how I could do this will be appreciated!
Upvotes: 1
Views: 70
Reputation: 47
1st) Expand the hyperparam_name Column
import pandas as pd
# Assuming hyperparam_df is your dataframe
expanded_df = pd.json_normalize(hyperparam_df['hyperparam_name'])
# Concatenate the expanded columns with the original dataframe
hyperparam_df = pd.concat([hyperparam_df.drop(columns=['hyperparam_name']), expanded_df], axis=1)
2nd) Handle Categorical Variables
from sklearn.preprocessing import LabelEncoder
# Identify categorical columns
categorical_columns = hyperparam_df.select_dtypes(include=['object']).columns
# Apply one-hot encoding or label encoding
for col in categorical_columns:
if hyperparam_df[col].nunique() > 10: # Example threshold for using label encoding
le = LabelEncoder()
hyperparam_df[col] = le.fit_transform(hyperparam_df[col])
else:
hyperparam_df = pd.get_dummies(hyperparam_df, columns=[col], prefix=[col])
3rd) Handle Missing Values
# Fill missing values with a default value, e.g., 0
hyperparam_df = hyperparam_df.fillna(0)
# Alternatively, drop rows with missing values
# hyperparam_df = hyperparam_df.dropna()
4th) Compute Correlation
correlation_matrix = hyperparam_df.corr()
# Optionally, visualize the correlation matrix using a heatmap
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.show()
for example:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
import matplotlib.pyplot as plt
# Assuming hyperparam_df is your dataframe
expanded_df = pd.json_normalize(hyperparam_df['hyperparam_name'])
hyperparam_df = pd.concat([hyperparam_df.drop(columns=['hyperparam_name']), expanded_df], axis=1)
# Identify categorical columns
categorical_columns = hyperparam_df.select_dtypes(include=['object']).columns
# Apply one-hot encoding or label encoding
for col in categorical_columns:
if hyperparam_df[col].nunique() > 10: # Example threshold for using label encoding
le = LabelEncoder()
hyperparam_df[col] = le.fit_transform(hyperparam_df[col])
else:
hyperparam_df = pd.get_dummies(hyperparam_df, columns=[col], prefix=[col])
# Fill missing values with a default value, e.g., 0
hyperparam_df = hyperparam_df.fillna(0)
# Compute correlation matrix
correlation_matrix = hyperparam_df.corr()
# Visualize the correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.show()
Upvotes: 0