Wes McClintick
Wes McClintick

Reputation: 507

Matching elements in a list

Just starting to program in R... Got stumped on this one, perhaps because I don't know where to begin.

Define a random variable to be equal to the number of trials before there is a match. So if you have a list of numbers, (4,5,7,11,3,11,12,8,8,1....), the first value of the random variable is 6 because by then there are two 11's.(4,5,7,11,3,11) The second value is 3 because then you have 2 8's..12,8,8. The code below creates the list of numbers, u, by simulating from a uniform distribution.

Thank-you for any help or pointers. I've included a full description of the problem I am solving below if anyone is interested (trying to learn by coding a statistics text).

set.seed(1); u = matrix(runif(1000), nrow=1000)
u[u > 0    & u <= 1/12]   <- 1
u[u > 1/12 & u <= 2/12]   <- 2
u[u > 2/12 & u <= 3/12]   <- 3
u[u > 3/12 & u <= 4/12]   <- 4
u[u > 4/12 & u <= 5/12]   <- 5
u[u > 5/12 & u <= 6/12]   <- 6
u[u > 6/12 & u <= 7/12]   <- 7
u[u > 7/12 & u <= 8/12]   <- 8
u[u > 8/12 & u <= 9/12]   <- 9
u[u > 9/12 & u <= 10/12]  <- 10
u[u > 10/12 & u <= 11/12] <- 11
u[u > 11/12 & u < 12/12] <- 12
table(u); u[1:10,]

Example 2.6-3 Concepts in Probability and Stochastic Modeling, Higgins Suppose we were to ask people at random in which month they were born. Let the random variable X denote the number of people we would need to ask before we found two people born in the same month. The possible values for X are 2,3,...13. That is, at least two people must be asked in order to have a match and no more than 13 need to be asked. With the simplifying assumption that every month is an equally likely candidate for a response, a computer simulation was used to estimate the probabilitiy mass function of X. The simulation generated birth months until a match was found. Based on 1000 repetitions of this experiment, the following empirical distribution and sample statistics were obtained...

Upvotes: 1

Views: 1277

Answers (1)

jbaums
jbaums

Reputation: 27388

R has a steep initial learning curve. I don't think it's fair to assume this is your homework, and yes, it's possible to find solutions if you know what you're looking for. However, I remember it being difficult at times to research problems online simply because I didn't know what to search for (I wasn't familiar enough with the terminology).

Below is an explanation of one approach to solving the problem in R. Read the commented code and try and figure out exactly what it's doing. Still, I would recommend working through a good beginner resource. From memory, a good one to get up and running is icebreakeR, but there are many out there...

# set the number of simulations
nsim <- 10000

# Create a matrix, with nsim columns, and fill it with something. 
#  The something with which you'll populate it is a random sample, 
#  with replacement, of month names (held in a built-in vector called
#  'month.abb'). We're telling the sample function that it should take 
#  13*nsim samples, and these will be used to fill the matrix, which 
#  has nsim columns (and hence 13 rows). We've chosen to take samples 
#  of length 13, because as your textbook states, 13 is the maximum
#  number of month names necessary for a month name to be duplicated.
mat <- matrix(sample(month.abb, 13*nsim, replace=TRUE), ncol=nsim)

# If you like, take a look at the first 10 columns
mat[, 1:10]

# We want to find the position of the first duplicated value for each column. 
#  Here's one way to do this, but it might be a bit confusing if you're just 
#  starting out. The 'apply' family of functions is very useful for
#  repeatedly applying a function to columns/rows/elements of an object.
#  Here, 'apply(mat, 2, foo)' means that for each column (2 represents columns,
#  1 would apply to rows, and 1:2 would apply to every cell), do 'foo' to that
#  column. Our function below extends this a little with a custom function. It
#  says: for each column of mat in turn, call that column 'x' and perform 
#  'match(1, duplicated(x))'. This match function will return the position
#  of the first '1' in the vector 'duplicated(x)'. The vector 'duplicated(x)'
#  is a logical (boolean) vector that indicates, for each element of x,
#  whether that element has already occurred earlier in the vector (i.e. if 
#  the month name has already occurred earlier in x, the corresponding element
#  of duplicated(x) will be TRUE (which equals 1), else it will be false (0).
#  So the match function returns the position of the first duplicated month 
#  name (well, actually the second instance of that month name). e.g. if 
#  x consists of 'Jan', 'Feb', 'Jan', 'Mar', then duplicated(x) will be 
#  FALSE, FALSE, TRUE, FALSE, and match(1, duplicated(x)) will return 3. 
#  Referring back to your textbook problem, this is x, a realisation of the 
#  random variable X.
# Because we've used the apply function, the object 'res' will end up with
#  nsim realisations of X, and these can be plotted as a histogram.
res <- apply(mat, 2, function(x) match(1, duplicated(x)))
hist(res, breaks=seq(0.5, 13.5, 1))

Histogram of results

Upvotes: 4

Related Questions