Reputation: 21
I am trying to calculate the correlation between two distributions (let’s call them height and weight). I know how to calculate a simple Pearson correlation using a frequentist approach: calculate the covariance, calculate the covariance matrix and then divide it by sd(x)×sd(y) The standard formula for covariance can be calculated as:Covariance formula 1
or via the correlation coefficient 𝜌 as: Covariance formula 2
The first question I have is whether it’s appropriate to calculate the mean as in the frequentist method. Would it be correct to obtain two separate posteriors (for weight and height), then calculate covariance and correlation using the first equation?
Is it possible to treat ρ as a posterior? Should I first obtain posteriors for height and weight, then calculate ρ afterward?
My code I'm trying to use PyMC, but in my model, I can only obtain an estimate for one distribution. The prior and posterior ρ differs, but I don't understand why.
high_values = df['high'].values #no NaN
weight_values = df["weight"].values
with pm.Model() as model:
rho_transformed = pm.Beta("rho_transformed", alpha=2, beta=2)
rho = pm.Deterministic("rho", rho_transformed * 2 - 1) # Map back to [-1, 1]
# Define mean
mu_high = pm.NegativeBinomial("mu_1", mu=mu_1, alpha = 4) #t or poisson ...
mu_weight = pm.NegativeBinomial('mu_2', mu=mu_2, alpha=shape) #negBin
# Difine std
sigma_high = pm.HalfNormal("sigma_1", sigma=1)
sigma_weight = pm.HalfNormal("sigma_2", sigma=1)
cov_matrix = pm.math.stack([
[sigma_high**2, rho * sigma_high * sigma_weight],
[rho * sigma_high * sigma_weight, sigma_weight**2]
]) #is it correct in bayes?
# Observed data
observed_data = np.column_stack([high_values, weight_values]).astype("float64")
mu_value = pt.as_tensor_variable([mu_high, mu_weight]) #thats a very strange thing
observed = pm.MvNormal("observed", mu=mu_value, cov=cov_matrix, observed=observed_data) #multivariate normal distribution
# Prior and Posterior Predictive Checks
prior_samples = pm.sample_prior_predictive(draws=1000)
posterior_samples = pm.sample_posterior_predictive(trace)
# Run the sampler
trace = pm.sample(1000, return_inferencedata=True) #Markov Chain Monte Carlo (MCMC) sampling to draw samples from the posterior distribution of each parameter.
You could call prior_samples.prior['rho'] and trace.posterior['rho'] to check that they differs:)
Upvotes: 2
Views: 67
Reputation: 35
Your code mostly works, but can be simplified in a number of ways (see code examples at the end of this post, where I also have some comments about the priors used here, and PyMC coding more generally; I got carried slightly away...).
Question 1: No, using formula 1 (and 2) directly on some set of (x, y)-values would not be the way to obtain an estimate of the correlation coefficient, rho, when doing a Bayesian analysis. Formula 1 allows you to compute the so-called "sample co-variance" for your specific set of observed (x, y) value-pairs (which are in the columns high
and weight
of the df
dataframe). However, the typical assumption is that your finite dataset is only a sample from a larger (potentially infinite) population of values, and your aim is to try to estimate (with uncertainty) the true underlying population-correlation based on your limited sample. To achieve that goal, in a Bayesian framework, you need to make some extra assumptions about how the data was generated (how the x and y values are related), and also specify what you think about possible values of rho before seeing any data (see question 2 below).
A Bayesian analysis would under all circumstances not give you additional (x, y), or ("high", "weight"), value-pairs to work with. (Well, ... you could in principle get so-called posterior predictive data values, and estimate a posterior distribution of rho using those, but that would be a very roundabout and indirect way to get that estimate - see question 2 and comments to code below).
Question 2: Yes, the outcome of a Bayesian analysis would result in a (joint) posterior distribution over all your model parameters, including rho, which would indicate your degree of belief in the different possible values of these parameters. The (marginal) posterior distribution over rho can be inspected, allowing you to answer such questions as for instance "what is the posterior probability that rho > 0.7" or "what is the 95% credible interval for rho".
No to obtain this posterior over rho, you do not need to first obtain "posteriors for height and weight" and then compute rho using formula 1.
Bayesian analysis, brief overview
Briefly, to get an estimate of some unknown parameters using a Bayesian approach you would need to:
Regarding the likelihood: Specifying the likelihood amounts to formulating a hypothesis about how the data is generated. This could be based on some mechanistic understanding of the system where the data originates and/or ideas about how your sample was obtained from a larger fixed population. In the present case, where we are interested in estimating a correlation coefficient for (x, y) real value pairs, a typical assumption would be that the (x, y) values are drawn from a bi-variate normal distribution with mean vector = (mu_x, mu_y), and covariance matrix Sigma, a 2 x 2 matrix looking as shown below:
Sigma = [[var(x), cov(x,y)],
cov(x,y), var(y) ]]
It turns out that cov(x,y) = rho * sd_x * sd_y, (where sd_x is the standard deviation of x - this is your formula 2), so Sigma can also be described using only the parameters sd_x, sd_y, and rho:
Sigma = [[ sd_x^2, rho * sd_x * sd_y],
rho * sd_x * sd_y, sd_y^2 ]]
When using MCMC, the outcome of such an analysis is a set of samples from the posterior, i.e., a data frame (or file) where each column corresponds to one of the model parameters (such as rho or mu_x), and each row corresponds to a sample from one step of the MCMC process. The set of values in a column constitute the empirical, marginal posterior distribution for that parameter, and you can plot histograms or compute the fraction of values larger than some threshold, etc, etc.
Getting the OP code to run: The provided code will run correctly as long as you:
df
(specifically this should be a
dataframe with columns "high" and "weight" - for instance use the simulation code included below)mu_1
and mu_2
(which are used in the priors for "mu_high" and "mu_weight"shape
(which is used in the prior for "mu_weight")import pymc as pm
) and pytensor (import pytensor.tensor as pt
)trace
variable defined)You will then get the posterior distribution samples for your model parameters (mu_1, mu_2, sigma_1, sigma_2, rho, rho_transformed
) in the variable trace
(which is an arviz.InferenceData object).
You can get plots of the posteriors as follows:
import arviz as az
az.style.use("arviz-darkgrid")
az.plot_trace(trace, combined=True)
You can extract the posterior samples into a pandas dataframe like this:
df_post = az.extract(trace).to_dataframe()
Simplified pymc code
Below is a simplified version of code addressing the same problem. It is divided into blocks, each of which would be well suited to run in a jupyter-lab cell. I have added comments after the code:
Import modules and simulate data:
import pymc as pm
import numpy as np
import pandas as pd
import arviz as az
import matplotlib.pyplot as plt
# Simulate the data
n = 50
mu = [170, 60]
sigma = [10, 5]
rho = 0.7
cov_matrix = np.array([[sigma[0]**2, rho * sigma[0] * sigma[1]],
[rho * sigma[0] * sigma[1], sigma[1]**2]])
samples = np.random.multivariate_normal(mu, cov_matrix, n)
df = pd.DataFrame(samples, columns=["high", "weight"])
# Plot simulated scatter data
plt.figure()
plt.scatter(df["high"], df["weight"], alpha=0.5, color="blue")
plt.title("Simulated data")
plt.xlabel("height")
plt.ylabel("weight")
plt.show()
Specify and check PyMC model structure:
# Specify pymc model structure
with pm.Model() as cormodel:
# Priors for means
mu_x = pm.TruncatedNormal("mu_x", mu=160, sigma=30, lower=0)
mu_y = pm.TruncatedNormal("mu_y", mu=50, sigma=20, lower=0)
# Priors for standard deviations (sigma)
sigma_x = pm.HalfNormal("sigma_x", sigma=10)
sigma_y = pm.HalfNormal("sigma_y", sigma=10)
# Prior for correlation (rho), using a bounded distribution
rho = pm.Uniform("rho", lower=-1, upper=1)
# Construct the 2x2 covariance matrix using the prior distributions
cov_matrix = pm.math.stack([
[sigma_x**2, sigma_x * sigma_y * rho],
[sigma_x * sigma_y * rho, sigma_y**2]
])
# Likelihood of the data
y_obs = pm.MvNormal("y_obs", mu=[mu_x, mu_y], cov=cov_matrix, observed=df)
# Check model structure is correct by plotting graph representation
modelgraph = pm.model_to_graphviz(cormodel)
display(modelgraph)
Run MCMC to get posterior samples, and produce plots:
# Get posterior samples for model parameters
with cormodel:
posterior_samples = pm.sample(draws=2000)
# Plot marginal posteriors and trace plots for all model parameters
az.style.use("arviz-darkgrid")
az.plot_trace(posterior_samples, combined=True)
plt.suptitle("Posterior Distributions and Trace Plots for Posterior Samples", fontsize=14, y=1.02)
plt.show()
Run MCMC to get prior samples, plot comparison of prior and posterior for rho:
# Get prior samples from model
with cormodel:
prior_samples = pm.sample_prior_predictive(draws=2000)
# Extract prior and posterior samples for "rho"
prior_rho = prior_samples.prior["rho"].values.flatten()
posterior_rho = posterior_samples.posterior["rho"].values.flatten()
# ArviZ density plot for prior and posterior of rho
plt.figure()
az.plot_dist(prior_samples.prior["rho"], color="blue", label="Prior", fill_kwargs={"alpha": 0.5})
az.plot_dist(posterior_samples.posterior["rho"], color="red", label="Posterior", fill_kwargs={"alpha": 0.5})
plt.xlabel("rho")
plt.ylabel("Density")
plt.title("Prior vs Posterior for rho")
plt.legend()
plt.show()
Notes on code changes:
df
as the observed argument (as done in the simplified code).Plot of prior vs posterior for rho, given example code above. Note how the prior is uniform over the entire possible range [-1,1], while the posterior is much more narrow, and centred on the ground truth value of 0.7 (known because we simulated the data).
Upvotes: 1