Cupitor
Cupitor

Reputation: 11657

Hierachical Bayesian Linear Regression using PyMC3 is super slow

I am trying to write some code for implementing HBM in the case of logistic regression using the adults dataset from the UCI repository.

I have already written the code, but sampling is super slow, on the order of 107s per sample, for even 64 dimensions or features. Am I doing something wrong?

I am attaching the code for reference. I also rescaled the data thanks to suggestions to try to speed it up, to no avail.

I appreciate any feedback.

The code is a mixture of what has been written here and here.

#re loading the dataset this time without converting the country into one-hot vector rather for hierarchical modeling
adult_df = pd.read_csv('adult.data', header=None, sep=', ', )

adult_df.columns = ["Age", "WorkClass", "fnlwgt", "Education", "EducationNum",
    "MaritalStatus", "Occupation", "Relationship", "Race", "Gender",
    "CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Income"]


adult_df["Income"] = adult_df["Income"].map({ "<=50K": 0, ">50K": 1 })

adult_df.drop("CapitalGain", axis=1, inplace=True,)
adult_df.drop("CapitalLoss", axis=1, inplace=True,)

adult_df.Age = adult_df.Age.astype(float)
adult_df.fnlwgt = adult_df.fnlwgt.astype(float)
adult_df.EducationNum = adult_df.EducationNum.astype(float)
adult_df.HoursPerWeek = adult_df.HoursPerWeek.astype(float)


# dropping native country here!!
adult_df = pd.get_dummies(adult_df, columns=[
    "WorkClass", "Education", "MaritalStatus", "Occupation", "Relationship",
    "Race", "Gender",
])

standard_scaler_cols = ["Age", "fnlwgt", "EducationNum", "HoursPerWeek",]
other_cols = list(set(adult_df.columns) - set(standard_scaler_cols))
mapper = DataFrameMapper(
    [([col,], StandardScaler(),) for col in standard_scaler_cols] +
    [(col, None,) for col in other_cols]
)



le = preprocessing.LabelEncoder()
country_idx = le.fit_transform(adult_df['NativeCountry'])

pd.value_counts(pd.Series(y_all))
y_all = adult_df["Income"].values
adult_df.drop("Income", axis=1, inplace=True,)

adult_df.drop("NativeCountry", axis=1, inplace=True,)
n_countries = len(set(country_idx))
n_features = len(adult_df.columns) 

min_max_scaler = preprocessing.MinMaxScaler()

adult_df = min_max_scaler.fit_transform(adult_df)

X_train, X_test, y_train, y_test, country_idx_train, country_idx_test = train_test_split(adult_df, y_all, country_idx, train_size=0.1, test_size=0.25, stratify=y_all, random_state=rs)

with pm.Model() as multilevel_model:

    # Hyperiors for intercept      
    mu_theta = pm.MvNormal(name='mu_a', mu=np.zeros(n_features), cov=np.eye(n_features), shape=n_features)

   packed_L_theta = pm.LKJCholeskyCov('packed_L', n=n_features,
                                 eta=2., sd_dist=pm.HalfCauchy.dist(2.5))
    L_theta = pm.expand_packed_triangular(n_features, packed_L_theta)
    theta = pm.MvNormal(mu=mu_theta, name='mu_theta', chol=L_theta, shape=[n_countries, n_features])


    # Hyperiors for intercept (Comment 1)
    mu_b = pm.StudentT('mu_b', nu=3, mu=0., sd=1.0)
    sigma_b = pm.HalfNormal('sigma_b', sd=1.0)

    b = pm.Normal('b', mu=mu_b, sd=sigma_b, shape=[n_countries, 1])
    # Calculate predictions given values
    # for intercept and slope 
    yhat = pm.invlogit(b[country_idx_train] +  pm.math.dot(theta[country_idx_train], np.asarray(X_train).T))

    #Make predictions fit reality

    y = pm.Binomial('y', n=np.ones(y_train.shape[0]), p=yhat, observed=y_train)

Upvotes: 0

Views: 356

Answers (1)

twiecki
twiecki

Reputation: 1316

You will probably have more success on our discourse with pymc3 questions: https://discourse.pymc.io/ I invite you to move your question there.

The first thing I would check is if your Theano is compiling against MKL libraries, or maybe even using Python mode. If you installed things via conda, that should give you MKL, if you're using pip it might be more difficult. http://deeplearning.net/software/theano/troubleshooting.html#test-blas

Upvotes: 2

Related Questions