Incognite
Incognite

Reputation: 41

Incorrect answer from Monte Carlo integration

How can I approximate the integral of [x^4 * sin(x)]/ [exp(1)^(x/5)] (0 to +Inf) with Monte Carlo method in R?

What I tried to do is

set.seed(666)
func1 <- function(x)
    {
      (x^4 * sin(x))/exp(1)^(x/5)
    }

n <- 1000000

x <- rexp(n, 0.2)
f <- func1(x)
E <- mean(f)

but the result of E is not right

Upvotes: 3

Views: 951

Answers (2)

Zheyuan Li
Zheyuan Li

Reputation: 73295

In the following I deliberately don't set a random seed.


As I mentioned in my comments, there are at least two introductory Q & A on Monte Carlo integration on Stack Overflow:

Both explained how to get Monte Carlo estimate, but forgot about the standard error of the estimate. It just turns out that Monte Carlo estimation has extremely slow convergence rate on your function.

It is commonly known that Monte Carlo integration has an O(1 / sqrt(N)) convergence rate, where N is the sample size and O() is the big O notation. However, the constant behind that big O can be very large for some functions, so the realistic convergence rate may be much much slower.

Your functions could be defined in at least two ways:

## direct definition
f <- function (x) x^4 * sin(x) * exp(-x/5)

## using gamma distribution; see ?rgamma
g <- function (x) sin(x) * 5 ^ 5 * gamma(5) * dgamma(x, 5, 1/5)

curve(f, from = 0, to = 100)
curve(g, add = TRUE, col = 2)

The 1st Q & A explained how to compute Monte Carlo integration using uniformly distributed samples. Your function f or g is almost zero beyond x = 200, so integration on [0, +Inf) is effectively on [0, 200]. The following function would return you the integration and its standard error:

MCI1 <- function (n) {
  x <- runif(n, 0, 200)
  y <- 200 * f(x)
  c(mean.default(y), sqrt(var(y) / n))
  }

Another way is via importance sampling, as explained in the 2nd Q & A. Here gamma distribution is used as proposal distribution (as Ben Bolker suggested).

MCI2 <- function (n) {
  x <- rgamma(n, 5, 0.2)
  y <- sin(x) * 75000
  c(mean.default(y), sqrt(var(y) / n))
  }

Now let's check the convergence rate.

n <- seq(1000, by = 5000, length = 100)
tail(n)
#[1] 471000 476000 481000 486000 491000 496000

b1 <- sapply(n, MCI1)
b2 <- sapply(n, MCI2)

For uniform sampling, we have

par(mfrow = c(1, 2))
plot(b1[1, ], main = "estimate")
plot(b1[2, ], main = "standard error")

b1[, (ncol(b1) - 5):ncol(b1)]
#         [,1]     [,2]      [,3]      [,4]      [,5]      [,6]
#[1,] 115.1243 239.9631  55.57149 -325.8631 -140.3745  78.61126
#[2,] 181.0025 179.9988 178.99367  178.2152  177.2193 175.31446

For gamma sampling, we have

par(mfrow = c(1, 2))
plot(b2[1, ], main = "estimate")
plot(b2[2, ], main = "standard error")

b2[, (ncol(b2) - 5):ncol(b2)]
#           [,1]       [,2]     [,3]      [,4]      [,5]      [,6]
#[1,] -100.70344 -150.71536 24.40841 -49.58032 169.85385 122.81731
#[2,]   77.22445   76.85013 76.53198  76.03692  75.69819  75.25755

Whatever method it is, note how big the standard error is (compared with the estimate itself), and how slow it reduces.

It is much easier to use numerical integration (not surprising for integrating univariate functions):

integrate(f, 0, 200)
#11.99356 with absolute error < 0.0012

## trapezoidal rule
200 * mean.default(f(seq(0, 200, length = 10000)))
#[1] 11.99236

In the trapezoidal rule, even if only 1e+4 evenly spaced sampling points are taken, the integration is close enough to the truth.


Remark

Monte Carlo integration would have a less struggle if we do integration on a more restricted domain. From the figure of f or g, we see that this is an oscillating function. And actually, it crosses x-axis with a period of pi. Let's consider an integration on [lower, upper].

MCI3 <- function (n, lower, upper) {
  x <- runif(n, lower, upper)
  y <- (upper - lower) * f(x)
  c(mean.default(y), sqrt(var(y) / n))
  }

a1 <- sapply(n, MCI3, lower = 0, upper = pi)
a2 <- sapply(n, MCI3, lower = pi, upper = 2 * pi)
a3 <- sapply(n, MCI3, lower = 2 * pi, upper = 3 * pi)
a4 <- sapply(n, MCI3, lower = 3 * pi, upper = 4 * pi)

a1[, (ncol(a1) - 5):ncol(a1)]
#            [,1]        [,2]        [,3]        [,4]        [,5]        [,6]
#[1,] 17.04658711 16.97935808 17.01094302 17.02117843 16.96935285 16.99552898
#[2,]  0.02407643  0.02390894  0.02379678  0.02368683  0.02354298  0.02342799

a2[, (ncol(a2) - 5):ncol(a2)]
#             [,1]         [,2]         [,3]         [,4]         [,5]
#[1,] -406.5646843 -404.9633321 -405.4300941 -405.4799659 -405.8337416
#[2,]    0.3476975    0.3463621    0.3442497    0.3425202    0.3409073
#             [,6]
#[1,] -405.8628741
#[2,]    0.3390045

a3[, (ncol(a3) - 5):ncol(a3)]
#            [,1]        [,2]        [,3]        [,4]        [,5]        [,6]
#[1,] 1591.539911 1592.280780 1594.307951 1591.375340 1593.171500 1591.648529
#[2,]    1.197469    1.190251    1.183095    1.177079    1.172049    1.165667

a4[, (ncol(a4) - 5):ncol(a4)]
#             [,1]         [,2]         [,3]         [,4]         [,5]
#[1,] -3235.561677 -3239.147235 -3241.532097 -3238.421556 -3238.667702
#[2,]     2.336684     2.321283     2.311647     2.300856     2.286624
#             [,6]
#[1,] -3237.043068
#[2,]     2.279032

Upvotes: 1

Severin Pappadeux
Severin Pappadeux

Reputation: 20080

If you're going to sample from exponential, it shouldn't be used again in the function.

From code

set.seed(32345)

func <- function(x) { (x^4 * sin(x)) }

n <- 10000000

x <- rexp(n, 0.2)
f <- func(x)
E <- mean(f)

I'm getting the answer

[1] 13.06643

UPDATE

It fluctuates, and fluctuates badly.

Lest first start with the right answer which according to Mathematica is equal to 4453125/371293 = 11.9936.

I transformed integral from

I = ∫ dx exp(-x/5) x4 sin(x)

using substitution y=x/5 to

I = 55 Γ(5) ∫ dy exp(-y) y5-1 / Γ(5) sin(5*y)

Everything but sin(5*y) is normalized gamma distribution, which we will use to sample, and sin(5*y) will be our function to compute mean value.

And used following trick together with large number of samples: I split calculation of positive values and negative values. It helps if you have fluctuating answer with values canceling each other. I did calculation in batches as well. Gamma function of 5 is just 4! (factorial)

Code

set.seed(32345)

N  <- 10000000 # number of samples per batch
NN <- 640      # number of batches

pos <- rep(0, NN) # positive values
neg <- rep(0, NN) # negative values

for(k in 1:NN) { # loop over batches
    y <- rgamma(N, shape=5, scale=1)
    f <- sin(5.0 * y)
    pnf <- ifelse(f > 0.0, f, 0.0)
    pos[k] <- mean(pnf)
    pnf <- ifelse(f < 0.0, -f, 0.0)
    neg[k] <- mean(pnf)
    print(k)
}

mean(pos)
sd(pos)/sqrt(NN)

mean(neg)
sd(neg)/sqrt(NN)

5*5*5*5*5*4*3*2*(mean(pos) - mean(neg))

Output

> mean(pos)
[1] 0.3183912
> sd(pos)/sqrt(NN)
[1] 4.749269e-06
> 
> mean(neg)
[1] 0.3182223
> sd(neg)/sqrt(NN)
[1] 5.087734e-06
> 
> 5*5*5*5*5*4*3*2*(mean(pos) - mean(neg))
[1] 12.67078

You could see that we really compute difference of two very close values, this is why it is hard to get convergence. It took a bit over 20 minutes to compute on my Xeon workstation.

And with different seed=12345

> mean(pos)
[1] 0.3183917
> sd(pos)/sqrt(NN)
[1] 4.835424e-06
> 
> mean(neg)
[1] 0.3182268
> sd(neg)/sqrt(NN)
[1] 4.633129e-06
> 
> 5*5*5*5*5*4*3*2*(mean(pos) - mean(neg))
[1] 12.36735 

Upvotes: 2

Related Questions