Reputation: 41
How can I approximate the integral of [x^4 * sin(x)]/ [exp(1)^(x/5)] (0 to +Inf) with Monte Carlo method in R?
What I tried to do is
set.seed(666)
func1 <- function(x)
{
(x^4 * sin(x))/exp(1)^(x/5)
}
n <- 1000000
x <- rexp(n, 0.2)
f <- func1(x)
E <- mean(f)
but the result of E is not right
Upvotes: 3
Views: 951
Reputation: 73295
In the following I deliberately don't set a random seed.
As I mentioned in my comments, there are at least two introductory Q & A on Monte Carlo integration on Stack Overflow:
Both explained how to get Monte Carlo estimate, but forgot about the standard error of the estimate. It just turns out that Monte Carlo estimation has extremely slow convergence rate on your function.
It is commonly known that Monte Carlo integration has an O(1 / sqrt(N))
convergence rate, where N
is the sample size and O()
is the big O notation. However, the constant behind that big O can be very large for some functions, so the realistic convergence rate may be much much slower.
Your functions could be defined in at least two ways:
## direct definition
f <- function (x) x^4 * sin(x) * exp(-x/5)
## using gamma distribution; see ?rgamma
g <- function (x) sin(x) * 5 ^ 5 * gamma(5) * dgamma(x, 5, 1/5)
curve(f, from = 0, to = 100)
curve(g, add = TRUE, col = 2)
The 1st Q & A explained how to compute Monte Carlo integration using uniformly distributed samples. Your function f
or g
is almost zero beyond x = 200
, so integration on [0, +Inf)
is effectively on [0, 200]
. The following function would return you the integration and its standard error:
MCI1 <- function (n) {
x <- runif(n, 0, 200)
y <- 200 * f(x)
c(mean.default(y), sqrt(var(y) / n))
}
Another way is via importance sampling, as explained in the 2nd Q & A. Here gamma distribution is used as proposal distribution (as Ben Bolker suggested).
MCI2 <- function (n) {
x <- rgamma(n, 5, 0.2)
y <- sin(x) * 75000
c(mean.default(y), sqrt(var(y) / n))
}
Now let's check the convergence rate.
n <- seq(1000, by = 5000, length = 100)
tail(n)
#[1] 471000 476000 481000 486000 491000 496000
b1 <- sapply(n, MCI1)
b2 <- sapply(n, MCI2)
For uniform sampling, we have
par(mfrow = c(1, 2))
plot(b1[1, ], main = "estimate")
plot(b1[2, ], main = "standard error")
b1[, (ncol(b1) - 5):ncol(b1)]
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 115.1243 239.9631 55.57149 -325.8631 -140.3745 78.61126
#[2,] 181.0025 179.9988 178.99367 178.2152 177.2193 175.31446
For gamma sampling, we have
par(mfrow = c(1, 2))
plot(b2[1, ], main = "estimate")
plot(b2[2, ], main = "standard error")
b2[, (ncol(b2) - 5):ncol(b2)]
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] -100.70344 -150.71536 24.40841 -49.58032 169.85385 122.81731
#[2,] 77.22445 76.85013 76.53198 76.03692 75.69819 75.25755
Whatever method it is, note how big the standard error is (compared with the estimate itself), and how slow it reduces.
It is much easier to use numerical integration (not surprising for integrating univariate functions):
integrate(f, 0, 200)
#11.99356 with absolute error < 0.0012
## trapezoidal rule
200 * mean.default(f(seq(0, 200, length = 10000)))
#[1] 11.99236
In the trapezoidal rule, even if only 1e+4 evenly spaced sampling points are taken, the integration is close enough to the truth.
Remark
Monte Carlo integration would have a less struggle if we do integration on a more restricted domain. From the figure of f
or g
, we see that this is an oscillating function. And actually, it crosses x-axis with a period of pi
. Let's consider an integration on [lower, upper]
.
MCI3 <- function (n, lower, upper) {
x <- runif(n, lower, upper)
y <- (upper - lower) * f(x)
c(mean.default(y), sqrt(var(y) / n))
}
a1 <- sapply(n, MCI3, lower = 0, upper = pi)
a2 <- sapply(n, MCI3, lower = pi, upper = 2 * pi)
a3 <- sapply(n, MCI3, lower = 2 * pi, upper = 3 * pi)
a4 <- sapply(n, MCI3, lower = 3 * pi, upper = 4 * pi)
a1[, (ncol(a1) - 5):ncol(a1)]
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 17.04658711 16.97935808 17.01094302 17.02117843 16.96935285 16.99552898
#[2,] 0.02407643 0.02390894 0.02379678 0.02368683 0.02354298 0.02342799
a2[, (ncol(a2) - 5):ncol(a2)]
# [,1] [,2] [,3] [,4] [,5]
#[1,] -406.5646843 -404.9633321 -405.4300941 -405.4799659 -405.8337416
#[2,] 0.3476975 0.3463621 0.3442497 0.3425202 0.3409073
# [,6]
#[1,] -405.8628741
#[2,] 0.3390045
a3[, (ncol(a3) - 5):ncol(a3)]
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 1591.539911 1592.280780 1594.307951 1591.375340 1593.171500 1591.648529
#[2,] 1.197469 1.190251 1.183095 1.177079 1.172049 1.165667
a4[, (ncol(a4) - 5):ncol(a4)]
# [,1] [,2] [,3] [,4] [,5]
#[1,] -3235.561677 -3239.147235 -3241.532097 -3238.421556 -3238.667702
#[2,] 2.336684 2.321283 2.311647 2.300856 2.286624
# [,6]
#[1,] -3237.043068
#[2,] 2.279032
Upvotes: 1
Reputation: 20080
If you're going to sample from exponential, it shouldn't be used again in the function.
From code
set.seed(32345)
func <- function(x) { (x^4 * sin(x)) }
n <- 10000000
x <- rexp(n, 0.2)
f <- func(x)
E <- mean(f)
I'm getting the answer
[1] 13.06643
UPDATE
It fluctuates, and fluctuates badly.
Lest first start with the right answer which according to Mathematica is equal to 4453125/371293 = 11.9936.
I transformed integral from
I = ∫ dx exp(-x/5) x4 sin(x)
using substitution y=x/5
to
I = 55 Γ(5) ∫ dy exp(-y) y5-1 / Γ(5) sin(5*y)
Everything but sin(5*y)
is normalized gamma distribution, which we will use to sample, and sin(5*y)
will be our function to compute mean value.
And used following trick together with large number of samples: I split calculation of positive values and negative values. It helps if you have fluctuating answer with values canceling each other. I did calculation in batches as well. Gamma function of 5 is just 4! (factorial)
Code
set.seed(32345)
N <- 10000000 # number of samples per batch
NN <- 640 # number of batches
pos <- rep(0, NN) # positive values
neg <- rep(0, NN) # negative values
for(k in 1:NN) { # loop over batches
y <- rgamma(N, shape=5, scale=1)
f <- sin(5.0 * y)
pnf <- ifelse(f > 0.0, f, 0.0)
pos[k] <- mean(pnf)
pnf <- ifelse(f < 0.0, -f, 0.0)
neg[k] <- mean(pnf)
print(k)
}
mean(pos)
sd(pos)/sqrt(NN)
mean(neg)
sd(neg)/sqrt(NN)
5*5*5*5*5*4*3*2*(mean(pos) - mean(neg))
Output
> mean(pos)
[1] 0.3183912
> sd(pos)/sqrt(NN)
[1] 4.749269e-06
>
> mean(neg)
[1] 0.3182223
> sd(neg)/sqrt(NN)
[1] 5.087734e-06
>
> 5*5*5*5*5*4*3*2*(mean(pos) - mean(neg))
[1] 12.67078
You could see that we really compute difference of two very close values, this is why it is hard to get convergence. It took a bit over 20 minutes to compute on my Xeon workstation.
And with different seed=12345
> mean(pos)
[1] 0.3183917
> sd(pos)/sqrt(NN)
[1] 4.835424e-06
>
> mean(neg)
[1] 0.3182268
> sd(neg)/sqrt(NN)
[1] 4.633129e-06
>
> 5*5*5*5*5*4*3*2*(mean(pos) - mean(neg))
[1] 12.36735
Upvotes: 2