Reputation: 11657
I am trying to write some code for implementing HBM in the case of logistic regression using the adults dataset from the UCI repository.
I have already written the code, but sampling is super slow, on the order of 107s per sample, for even 64 dimensions or features. Am I doing something wrong?
I am attaching the code for reference. I also rescaled the data thanks to suggestions to try to speed it up, to no avail.
I appreciate any feedback.
The code is a mixture of what has been written here and here.
#re loading the dataset this time without converting the country into one-hot vector rather for hierarchical modeling
adult_df = pd.read_csv('adult.data', header=None, sep=', ', )
adult_df.columns = ["Age", "WorkClass", "fnlwgt", "Education", "EducationNum",
"MaritalStatus", "Occupation", "Relationship", "Race", "Gender",
"CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Income"]
adult_df["Income"] = adult_df["Income"].map({ "<=50K": 0, ">50K": 1 })
adult_df.drop("CapitalGain", axis=1, inplace=True,)
adult_df.drop("CapitalLoss", axis=1, inplace=True,)
adult_df.Age = adult_df.Age.astype(float)
adult_df.fnlwgt = adult_df.fnlwgt.astype(float)
adult_df.EducationNum = adult_df.EducationNum.astype(float)
adult_df.HoursPerWeek = adult_df.HoursPerWeek.astype(float)
# dropping native country here!!
adult_df = pd.get_dummies(adult_df, columns=[
"WorkClass", "Education", "MaritalStatus", "Occupation", "Relationship",
"Race", "Gender",
])
standard_scaler_cols = ["Age", "fnlwgt", "EducationNum", "HoursPerWeek",]
other_cols = list(set(adult_df.columns) - set(standard_scaler_cols))
mapper = DataFrameMapper(
[([col,], StandardScaler(),) for col in standard_scaler_cols] +
[(col, None,) for col in other_cols]
)
le = preprocessing.LabelEncoder()
country_idx = le.fit_transform(adult_df['NativeCountry'])
pd.value_counts(pd.Series(y_all))
y_all = adult_df["Income"].values
adult_df.drop("Income", axis=1, inplace=True,)
adult_df.drop("NativeCountry", axis=1, inplace=True,)
n_countries = len(set(country_idx))
n_features = len(adult_df.columns)
min_max_scaler = preprocessing.MinMaxScaler()
adult_df = min_max_scaler.fit_transform(adult_df)
X_train, X_test, y_train, y_test, country_idx_train, country_idx_test = train_test_split(adult_df, y_all, country_idx, train_size=0.1, test_size=0.25, stratify=y_all, random_state=rs)
with pm.Model() as multilevel_model:
# Hyperiors for intercept
mu_theta = pm.MvNormal(name='mu_a', mu=np.zeros(n_features), cov=np.eye(n_features), shape=n_features)
packed_L_theta = pm.LKJCholeskyCov('packed_L', n=n_features,
eta=2., sd_dist=pm.HalfCauchy.dist(2.5))
L_theta = pm.expand_packed_triangular(n_features, packed_L_theta)
theta = pm.MvNormal(mu=mu_theta, name='mu_theta', chol=L_theta, shape=[n_countries, n_features])
# Hyperiors for intercept (Comment 1)
mu_b = pm.StudentT('mu_b', nu=3, mu=0., sd=1.0)
sigma_b = pm.HalfNormal('sigma_b', sd=1.0)
b = pm.Normal('b', mu=mu_b, sd=sigma_b, shape=[n_countries, 1])
# Calculate predictions given values
# for intercept and slope
yhat = pm.invlogit(b[country_idx_train] + pm.math.dot(theta[country_idx_train], np.asarray(X_train).T))
#Make predictions fit reality
y = pm.Binomial('y', n=np.ones(y_train.shape[0]), p=yhat, observed=y_train)
Upvotes: 0
Views: 356
Reputation: 1316
You will probably have more success on our discourse with pymc3 questions: https://discourse.pymc.io/ I invite you to move your question there.
The first thing I would check is if your Theano is compiling against MKL libraries, or maybe even using Python mode. If you installed things via conda, that should give you MKL, if you're using pip it might be more difficult. http://deeplearning.net/software/theano/troubleshooting.html#test-blas
Upvotes: 2