dtcitron
dtcitron

Reputation: 13

Fitting Binomial Distribution in R using data with varying sample sizes

I have some data that looks like this:

    x    y
1:  3    1
2:  6    1
3:  1    0
4: 31    8
5:  1    0
---

(Edit: if it helps, here are sample vectors for x and y

    x = c(3, 6, 1, 31, 1, 18, 73, 29, 2, 1)

    y = c(1, 1, 0, 8, 0, 0, 8, 1, 0, 0)

The column on the left (x) is my sample size, and the column on the right (y) is the number successes that occur in each sample.

I would like to fit these data using a binomial distribution in order to find the probability of a success (p). All examples for fitting a binomial distribution that I've found so far assume a constant sample size (n) across all data points, but here I have varying sample sizes.

  1. How do I fit data like these, with varying sample sizes, to a binomial distribution? The desired outcome is p, the probability of observing a success in a sample size of 1.

  2. How do I accomplish a fit like this using R?

(Edit #2: Response below outlines solution and related R code if I assume that the events observed in each sample can be assumed to be independent, in addition to assuming that the samples themselves are also independent. This works for my data - thanks!)

Upvotes: 1

Views: 6140

Answers (2)

hanice
hanice

Reputation: 61

library(bbmle)

x = c(3, 6, 1, 31, 1, 18, 73, 29, 2, 1)
y = c(1, 1, 0, 8, 0, 0, 8, 1, 0, 0)

mf = function(prob, x, size){
  -sum(dbinom(x, size, prob, log=TRUE))
}

m1 = mle2(mf, start=list(prob=0.01), data=list(x=y, size=x))
print(m1)

Coefficients: prob 0.1151535

Log-likelihood: -13.47

Upvotes: 2

niko
niko

Reputation: 5281

What about calculating the empirical probability of success

x <- c(3, 6, 1, 31, 1, 18, 73, 29, 2, 1)
y <-  c(1, 1, 0, 8, 0, 0, 8, 1, 0, 0)

avr.sample <- mean(x)
avr.success <- mean(y)

p <- avr.success/avr.sample
[1] 0.1151515

Or using binom.test

z <- x-y # number of fails
binom.test(x = c(sum(y), sum(z)))
Exact binomial test

data:  c(sum(y), sum(z))
number of successes = 19, number of trials = 165, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.07077061 0.17397215
sample estimates:
probability of success 
             0.1151515 

However, this assumes that:

  1. The events corresponding to the rows are independent from each other
  2. The events in the same row are independent from each other as well

This means in every iteration k of the experiment (i.e. row of x) we execute an action such as throwing x[k] identical dices (not necessarily fair dices) and success would mean to get a given (predetermined) number n in 1:6.

If we supposed that that above results were achieved when trying to get a 1 when throwing x[k] dices in every iteration k, then one could say that the empirical probability of getting a 1 is (~) 0.1151515.

In the end, the distribution in question would be B(sum(x), p).

PS: In the above illustration, the dices are identical to each other not only in any given iteration but across all iterations.

Upvotes: 2

Related Questions