Reputation: 1112
I am trying to calculate Chi Square discrepancies for the observed and simulated data and evaluate the model fit using Bayesian inference. The observed dataset contains missing ("NA") values. However, there are no missing values for the simulated one. Thus, I am unable to compare the discrepancy stats between them.
The code presented below is an example, which is similar to my work:
p <- array(runif(3000*195*6, 0, 1), c(3000, 195, 6))
N <- array(rpois(3000*195, 10), c(3000, 195))
y <- array(0, c(195, 6))
for(j in 1:195){
for(k in 1:6){
y[j,k] <- (rbinom(1, N[j], p[1,j,k]))
}
}
foo <- runif(50, 1, 195)
bar <- runif(50, 1, 6)
for(i in 1:50){
y[foo[i], bar[i]] <- NA
}
The code derives the response variable y including some missing values ("NA"). Then, I calculated Chi Square for the data "y" and the simulated "ideal" dataset "y.new". On the contrary, y.new does not have any missing values. Thus, when I try to compare the sum of E and E.new, E.new should always be larger if I leave out the missing data in y but not y.new.
eval <- array(NA, c(3000, 195, 6))
E <- array(NA, c(3000, 195, 6))
E.new <- array(NA, c(3000, 195, 6))
y.new <- array(NA, c(195, 6))
for(i in 1:3000){
for(j in 1:195){
for(k in 1:6){
eval[i,j,k] <- p[i,j,k]*N[i,j]
E[i,j,k] <- ((y[j,k] - eval[i,j,k])^2) / (eval[i,j,k] + 0.5)
y.new[i,j,k] <- rbinom(1, N[i,j], p[i,j,k]) # Create new "ideal" dataset
E.new[i,j,k] <- ((y.new[i,j,k] - eval[i,j,k])^2) / (eval[i,j,k] + 0.5)
}
}
} # very slow! think about how to vectorize instead of nested for loops
fit <- sum(E)
fit.new <- sum(E.new)
Now, my question is how to handle the missing values? Currently, the code above cannot subtract eval from y because of the missing values. Even if it could, the fit and fit.new wouldn't be comparable. My idea is to find the location of the missing values in y and drop those same [j,k] values from all the other arrays that I'm using. Any suggestions on how to best do this?
EDIT: I'm getting a very strange result. Whether I run the code as above or as below (using sweep), E[1,,] is much smaller than E[>1,,]. What is particularly strange is that eval[1,,] and eval[>1,,] appear to be the same. I even tried replicating y[j,k] to make it y[i,j,k] where each y[i,,] were equal, just to see if it was the handling of different size matrices that was the problem. Does anyone know why this would be the case? In theory, with this simulated data, I think all the iterations of E[i,,] and E.new[i,,] should be somewhat similar. Below is some summary info to show what I'm talking about. This seems like a new question, but it relates to my original question, I just thought it must be the NA that were causing the problem but it seems like that might not be the only thing going on.
> summary(eval[1,,])
V1 V2 V3 V4
Min. : 0.01167 Min. : 0.01476 Min. : 0.0293 Min. : 0.01953
1st Qu.: 2.60909 1st Qu.: 2.35093 1st Qu.: 2.5239 1st Qu.: 1.85789
Median : 4.85460 Median : 5.12719 Median : 5.2480 Median : 4.35639
Mean : 5.09371 Mean : 5.39451 Mean : 5.3891 Mean : 4.72061
3rd Qu.: 6.91273 3rd Qu.: 7.44676 3rd Qu.: 7.5431 3rd Qu.: 7.06119
Max. :15.81298 Max. :14.94309 Max. :14.9851 Max. :16.25751
> summary(eval1[2,,])
V1 V2 V3 V4
Min. : 0.06346 Min. : 0.06468 Min. : 0.2092 Min. : 0.006769
1st Qu.: 2.44825 1st Qu.: 1.93702 1st Qu.: 2.4226 1st Qu.: 2.426689
Median : 4.16865 Median : 4.01536 Median : 5.0771 Median : 4.833679
Mean : 4.85646 Mean : 4.64887 Mean : 5.3450 Mean : 5.169656
3rd Qu.: 6.64691 3rd Qu.: 6.96278 3rd Qu.: 7.7034 3rd Qu.: 7.229125
Max. :13.00335 Max. :13.79093 Max. :17.2673 Max. :17.915080
> summary(E[1,,])
V1 V2 V3 V4
Min. :0.00001 Min. :0.00000 Min. :0.000003 Min. :0.000008
1st Qu.:0.02744 1st Qu.:0.02723 1st Qu.:0.023008 1st Qu.:0.035854
Median :0.11750 Median :0.11889 Median :0.109138 Median :0.146706
Mean :0.39880 Mean :0.41636 Mean :0.353876 Mean :0.479533
3rd Qu.:0.46435 3rd Qu.:0.40993 3rd Qu.:0.390625 3rd Qu.:0.604021
Max. :4.43466 Max. :4.83871 Max. :6.254577 Max. :5.231650
NA's :10 NA's :8 NA's :8 NA's :10
> summary(E[2,,])
V1 V2 V3
Min. : 0.0000 Min. : 0.00003 Min. : 0.00002
1st Qu.: 0.8213 1st Qu.: 0.42091 1st Qu.: 0.36853
Median : 2.0454 Median : 2.31697 Median : 2.39892
Mean : 8.0619 Mean : 9.40838 Mean : 6.38919
3rd Qu.: 5.6755 3rd Qu.: 6.34782 3rd Qu.: 4.89749
Max. :395.9499 Max. :172.83324 Max. :120.93648
NA's :10 NA's :8 NA's :8
Thanks, Dan
Upvotes: 2
Views: 6812
Reputation: 2151
You can add a test inside the inner loop and change the order of the loops as follows:
...
for(j in 1:195){
for(k in 1:6){
if ( !is.na(y(j,k)) ) {
for(i in 1:3000){
...
}
}
}
}
...
For more efficiency vectorize the inner loops (as described in the comments above).
It is also possible to define a logical array with the same dimensions as y
representing the subset of defined positions, e.g., subset <- !is.na(y)
and use it instead.
Upvotes: 4