Matt Mulvahill
Matt Mulvahill

Reputation: 23

dplyr::mutate comparing each value to vector, collapsing with any/all

I have a dataset of true values (location) that I'm attempting to compare to a vector of estimated values using dplyr. My code below results in an error message. How do I compare each value of data$location to every value of est.locations and collapse the resulting vector to true if all comparisons are greater than 20?

library(dplyr)
data <- data.frame("num" = 1:10, "location" = runif(10, 0, 1500) %>%   sort)
est.locations <- runif(12, 0, 1500) %>% sort

data %>% 
  mutate(false.neg = (all(abs(location - est.locations) > 20)))

   num  location false.neg
1    1  453.4281     FALSE
2    2  454.4260     FALSE
3    3  718.0420     FALSE
4    4  801.2217     FALSE
5    5  802.7981     FALSE
6    6  854.2148     FALSE
7    7  873.6085     FALSE
8    8  901.0217     FALSE
9    9 1032.8321     FALSE
10  10 1240.3547     FALSE
Warning message:
In c(...) :
  longer object length is not a multiple of shorter object length

The context of the question is dplyr, but I'm open to other suggestions that may be faster. This is a piece of a larger calculation I'm doing on birth-death mcmc chains for 3000 iterations * 200 datasets. (i.e. repeated many times and the number of locations will be different among datasets and for each iteration.)

UPDATE (10/13/15):

I'm going to mark akrun's solution as the answer. A linear algebra approach is a natural fit for this problem and with a little tweaking this will work for calculating both FNR and FPR (FNR should need an (l)apply by iteration, FPR should be one large vector/matrix operation).

JohannesNE's solution points out the issue with my initial approach -- the use of any() reduces the number of rows to a single value, when instead I intended to do this operation row-wise. Which also leads me to think there is likely a dplyr solution using rowwise() and do().

I attempted to limit the scope of the question in my initial post. But for added context, the full problem is on a Bayesian mixture model with an unknown number of components, where the components are defined by a 1D point process. Estimation results in a 'random effects' chain similar in structure to the version of est.locations below. The length mismatch is a result of having to estimate the number of components.

## Clarification of problem
options("max.print" = 100)
set.seed(1)

# True values (number of items and their location)
true.locations <- 
  data.frame("num"      = 1:10, 
             "location" = runif(10, 0, 1500) %>% sort)

# Mcmc chain of item-specific values ('random effects')
iteration <<- 0
est.locations <- 
  lapply(sample(10:14, 3000, replace=T), function(x) {
      iteration  <<- iteration + 1
      total.items <- rep(x, x)
      num         <- 1:x
      location    <- runif(x, 0, 1500) %>% sort
      data.frame(iteration, total.items, num, location)
    }) %>% do.call(rbind, .) 
print(est.locations)

      iteration total.items num      location
1             1          11   1   53.92243818
2             1          11   2  122.43662006
3             1          11   3  203.87297671
4             1          11   4  641.70211495
5             1          11   5  688.19477968
6             1          11   6 1055.40283048
7             1          11   7 1096.11595818
8             1          11   8 1210.26744065
9             1          11   9 1220.61185888
10            1          11  10 1362.16553219
11            1          11  11 1399.02227302
12            2          10   1  160.55916378
13            2          10   2  169.66834129
14            2          10   3  212.44257723
15            2          10   4  228.42561489
16            2          10   5  429.22830291
17            2          10   6  540.42659572
18            2          10   7  594.58339156
19            2          10   8  610.53964624
20            2          10   9  741.62600969
21            2          10  10  871.51458277
22            3          13   1   10.88957267
23            3          13   2   42.66629869
24            3          13   3  421.77297967
25            3          13   4  429.95036650
 [ reached getOption("max.print") -- omitted 35847 rows ]

Upvotes: 1

Views: 913

Answers (2)

akrun
akrun

Reputation: 887223

We can use outer for this kind of comparison. We get all the combination of difference between 'location' and 'est.locations', take the abs, compare with 20, negate (!), do the rowSums and negate again so that if all the elements in the rows are greater than 20, it will be TRUE.

data$false.neg <- !rowSums(!abs(outer(data$location, est.locations, FUN='-'))>20) 

Upvotes: 0

JohannesNE
JohannesNE

Reputation: 1363

You can use sapply (here inside mutate, but not really taking advantage of its functions).

library(dplyr)
data <- data.frame("num" = 1:10, "location" = runif(10, 0, 1500) %>%   sort)
est.locations <- runif(12, 0, 1500) %>% sort

data %>% 
    mutate(false.neg = sapply(location, function(x) {
        all(abs(x - est.locations) > 20)
    }))

   num   location false.neg
1    1   92.67941      TRUE
2    2  302.52290     FALSE
3    3  398.26299      TRUE
4    4  558.18585     FALSE
5    5  859.28005      TRUE
6    6  943.67107      TRUE
7    7  991.19669      TRUE
8    8 1347.58453      TRUE
9    9 1362.31168      TRUE
10  10 1417.01290     FALSE

Upvotes: 2

Related Questions