M. Regan
M. Regan

Reputation: 231

Why does this hierarchical Poisson model not match true params from generated data?

I am trying to fit a hierarchical Poisson regression to estimate time_delay per group and globally. I am confused as to whether pymc automatically applies a log link function to mu or do I have to do so explicitly:

with pm.Model() as model:
    alpha = pm.Gamma('alpha', alpha=1, beta=1)
    beta = pm.Gamma('beta', alpha=1, beta=1)

    a = pm.Gamma('a', alpha=alpha, beta=beta, shape=n_participants)

    mu = a[participants_idx]
    y_est = pm.Poisson('y_est', mu=mu, observed=messages['time_delay'].values)

    start = pm.find_MAP(fmin=scipy.optimize.fmin_powell)
    step = pm.Metropolis(start=start)
    trace = pm.sample(20000, step, start=start, progressbar=True)

The below traceplot shows estimates for a. You can see group estimates between 0 and 750.

traceplot

My confusion begins when I plot the hyper parameter gamma distribution by using the mean for alpha and beta as parameters. The below distribution shows support between 0 and 5 approx. This doesn't fit my expectation whilst looking at the group estimates for a above. What does a represent? Is it log(a) or something else?

enter image description here

Thanks for any pointers.


Adding example using fake data as requested in comments: This example has just a single group, so it should be easier to see if the hyper parameter could plausibly produce the Poisson distribution of the group.

test_data = []
model = []

for i in np.arange(1):
    # between 1 and 100 messages per conversation
    num_messages = np.random.uniform(1, 100)
    avg_delay = np.random.gamma(15, 1)
    for j in np.arange(num_messages):
        delay = np.random.poisson(avg_delay)

        test_data.append([i, j, delay, i])

    model.append([i, avg_delay])

model_df = pd.DataFrame(model, columns=['conversation_id', 'synthetic_mean_delay'])
test_df = pd.DataFrame(test_data, columns=['conversation_id', 'message_id', 'time_delay', 'participants_str'])
test_df.head()

# Estimate parameters of model using test data
# convert categorical variables to integer
le = preprocessing.LabelEncoder()
test_participants_map = le.fit(test_df['participants_str'])
test_participants_idx = le.fit_transform(test_df['participants_str'])
n_test_participants = len(test_df['participants_str'].unique())

with pm.Model() as model:
    alpha = pm.Gamma('alpha', alpha=1, beta=1)    
    beta = pm.Gamma('beta', alpha=1, beta=1)

    a = pm.Gamma('a', alpha=alpha, beta=beta, shape=n_test_participants)

    mu = a[test_participants_idx]

    y = test_df['time_delay'].values
    y_est = pm.Poisson('y_est', mu=mu, observed=y)

    start = pm.find_MAP(fmin=scipy.optimize.fmin_powell)
    step = pm.Metropolis(start=start)
    trace = pm.sample(20000, step, start=start, progressbar=True)

enter image description here

I don't see how the below hyper parameter could produce a poisson distribution with parameter between 13 and 17.

enter image description here

Upvotes: 3

Views: 660

Answers (1)

M. Regan
M. Regan

Reputation: 231

ANSWER: pymc uses different parameters than scipy to represent Gamma distributions. scipy uses alpha & scale, whereas pymc uses alpha and beta. The below model works as expected:

with pm.Model() as model:
    alpha = pm.Gamma('alpha', alpha=1, beta=1)    
    scale = pm.Gamma('scale', alpha=1, beta=1)

    a = pm.Gamma('a', alpha=alpha, beta=1.0/scale, shape=n_test_participants)

    #mu = T.exp(a[test_participants_idx])
    mu = a[test_participants_idx]

    y = test_df['time_delay'].values
    y_est = pm.Poisson('y_est', mu=mu, observed=y)

    start = pm.find_MAP(fmin=scipy.optimize.fmin_powell)
    step = pm.Metropolis(start=start)
    trace = pm.sample(20000, step, start=start, progressbar=True)

enter image description here

enter image description here

Upvotes: 3

Related Questions