Reputation: 406
I have the following function:
func <- function(scores, labels, thresholds) {
labels <- if (is.data.frame(labels)) labels else data.frame(labels)
sapply(thresholds, function(t) { sapply(labels, function(lbl) { sum(lbl[which(scores >= t)]) }) })
}
I also have the following that I'll pass into func
.
> scores
[1] 0.187 0.975 0.566 0.793 0.524 0.481 0.005 0.756 0.062 0.124
> thresholds
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
> var1
[1] 1 1 0 0 0 1 0 1 1 1
> df
var1 var2
1 1 0
2 1 1
3 0 0
4 0 0
5 0 0
6 1 1
7 0 1
8 1 1
9 1 1
10 1 0
Here are two different calls two func
, one with labels
as a vector, and the other with labels
as a data.frame:
> func(scores, var1, thresholds)
labels labels labels labels labels labels labels labels labels labels labels
6 5 3 3 3 2 2 2 1 1 0
> func(scores, df, thresholds)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
var1 6 5 3 3 3 2 2 2 1 1 0
var2 5 3 3 3 3 2 2 2 1 1 0
Why does "labels" get applied as a colname in the vector version, and "var1" and "var2" get applied as a rowname in the data.frame version?
What I'm looking for is the vector version to be more like:
> func(scores, var1, thresholds)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
labels 6 5 3 3 3 2 2 2 1 1 0
To create the variables above:
scores <- sample(seq(0, 1, 0.001), 10, replace = T)
thresholds <- seq(0, 1, 0.1)
var1 <- sample(c(0, 1), 10, replace = T)
var2 <- sample(c(0, 1), 10, replace = T)
df <- data.frame(var1, var2)
Upvotes: 3
Views: 957
Reputation: 10437
Note: @weihuang-wong 's answer is great, and the solution is in some ways better than this one. But I already had most of this answer written before that answer was posted, so I decided to post this answer anyway.
You get the names you do because those are the names of the things you iterate over. But why do you get a named vector in the first case and a matrix with rownames in the second case? Here is a simpler case that makes it easier to see.
sapply(1, function(x) sapply(c(a = 1), function(y) y))
# a
# 1
sapply(1, function(x) sapply(c(a = 1, b = 2), function(y) y))
# [,1]
# a 1
# b 2
OK, so what is happening here? Let's break it down so we can see.
sapply(c(a = 1), function(y) y)
returns a named length-one vector.
sapply(c(a = 1, b = 2), function(y) y)
returns a named length-two vector.
Now it's the job of the outer sapply
to combine those results. When it sees that the inner sapply
returns a length-one vector it simplifies it to a named vector. That simplification doesn't work when the return value is of length > 1, so sapply
simplifies to a matrix instead.
So if we want consistency we need sapply
to return a matrix, even in the length-one case. How do we make sapply
consistent? It's surprisingly difficult. In the end I would just convert it to a matrix after the fact.
matrix(sapply(1, function(x) sapply(c(a = 1), function(y) y)), dimnames = list("a"))
# [,1]
# a 1
Now that we understand what's happening we can apply what we've learned to the original problem.
func <- function(scores, labels, thresholds) {
labels <- if (is.data.frame(labels)) labels else data.frame(labels)
r <- sapply(thresholds, function(t) { sapply(labels, function(lbl) { sum(lbl[which(scores >= t)]) }) })
if(!is.matrix(r)) r <- matrix(r, nrow = 1, dimnames = list(names(labels)))
r
}
func(scores, df, thresholds)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
# var1 6 5 3 3 3 2 2 2 1 1 0
# var2 5 3 3 3 3 2 2 2 1 1 0
func(scores, var1, thresholds)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
# labels 6 5 3 3 3 2 2 2 1 1 0
Upvotes: 4
Reputation: 13118
Try switching the order of the nested sapply
s:
func <- function(scores, labels, thresholds) {
labels <- if (is.data.frame(labels)) labels else data.frame(labels)
t(sapply(labels, function(lbl) {
sapply(thresholds, function(t) sum(lbl[which(scores >= t)]))
}))
}
From ?sapply
:
‘sapply’ is a user-friendly version and wrapper of ‘lapply’ by default returning a vector, matrix or, if ‘simplify = "array"’, an array if appropriate, by applying ‘simplify2array()’.
To understand what's going on in your original function, it's perhaps useful to think about each sapply
in turn.
The inner sapply(labels, ...)
creates a named vector of length k (where k is the number of columns in labels
-- so k is 1 in the vector case, and 2 in the dataframe example), where the names of the vector elements are given by the column names (labels
in the vector case, and var1
/var2
in the dataframe example).
The outer sapply(thresholds, ...)
runs the inner sapply
11 times, each time with a different value of t
. So in the vector case, you'll end up with 11 vectors of length 1 where the name of the one and only element in each vector is labels
, which sapply
returns ("simplifies") as one vector of length 11.
By switching the order of the sapply
s, the inner sapply
now returns an unnamed vector of length 11. The outer sapply
then does this k times. In the vector case, k is 1, and the name of the vector returned is labels
. In the dataframe example, k is 2, and the names of the 2 vectors returned are var1
and var2
.
(It might also be a useful exercise to name the elements in the thresholds
vector; e.g. thresholds <- setNames(seq(0, 1, 0.1), LETTERS[1:11])
and re-run func
to see what happens.)
Upvotes: 4