Reputation: 163
I'm using a data.frame:
data.frame("A"=c(NA,5,NA,NA,NA),
"B"=c(1,2,3,4,NA),
"C"=c(NA,NA,NA,2,3),
"D"=c(NA,NA,NA,7,NA))
This delivers a data.frame in this form:
A B C D
1 NA 1 NA NA
2 5 2 NA NA
3 NA 3 NA NA
4 NA 4 2 7
5 NA NA 3 NA
My aim is to check each row of the data.frame, if there is a value greater than a specific one (let's assume 2) and to get the name of the columns where this is the case.
The desired output (value greater 2) should be:
for row 1 of the data.frame
x[1,]: c()
for row 2
x[2,]: c("A")
for row3
x[3,]: c("B")
for row4
x[4,]: c("B","D")
and for row5 of the data.frame
x[5,]: c("C")
Thanks for your help!
Upvotes: 6
Views: 6636
Reputation: 51
why not do
colnames(df[,df[i,]>2])
for each row, where df is your data frame and i is the row number ;)
Upvotes: 1
Reputation: 118779
To answer @flodel's concerns, I'll write it as a separate answer:
lapply
gets a list and apply
doesn't guarantee this always:A fair point. I'll illustrate the issue with an example:
df <- structure(list(A = c(3, 5, NA, NA, NA), B = c(1, 2, 3, 1, NA),
C = c(NA, NA, NA, 2, 3), D = c(NA, NA, NA, 7, NA)), .Names = c("A",
"B", "C", "D"), row.names = c(NA, -5L), class = "data.frame")
A B C D
1 3 1 NA NA
2 5 2 NA NA
3 NA 3 NA NA
4 NA 1 2 7
5 NA NA 3 NA
# using `apply` results in a vector:
apply(df, 1, function(x) names(which(x>2)))
# [1] "A" "A" "B" "D" "C"
So, how can we guarantee a list with apply
?
By creating a list
within the function argument and then use unlist
with recursive = FALSE
, as shown below:
unlist(apply(df, 1, function(x) list(names(which(x>2)))), recursive=FALSE)
[[1]]
[1] "A"
[[2]]
[1] "A"
[[3]]
[1] "B"
[[4]]
[1] "D"
[[5]]
[1] "C"
lapply
is overall shorter, and does not require anonymous function:Yes, but it's slower. Let me illustrate this on a big example.
set.seed(45)
df <- as.data.frame(matrix(sample(c(1:10, NA), 1e5 * 100, replace=TRUE),
ncol = 100))
system.time(t1 <- lapply(apply(df > 2, 1, which), names))
user system elapsed
5.025 0.342 5.651
system.time(t2 <- unlist(apply(df, 1, function(x)
list(names(which(x>2)))), recursive=FALSE))
user system elapsed
2.860 0.181 3.065
identical(t1, t2) # TRUE
lapply(split(df, rownames(df)), function(x)names(x)[which(x > 2)])
First, I don't get as to what's wrong. If you're talking about the list being unnamed
, this can be changed by just setting the names just once at the end.
Second, unfortunately, using split
on a huge data.frame which will result in too many split elements will be terribly slow (due to huge factor levels).
# testing on huge data.frame
system.time(t3 <- lapply(split(df, rownames(df)), function(x)names(x)[which(x > 2)]))
user system elapsed
517.545 0.312 517.872
Third, this orders the elements as 1, 10, 100, 1000, 10000, 100000, ...
instead of 1 .. 1e5
. Instead one could just use setNames
or setnames
(from data.table
package) to just do this once finally, as shown below:
# setting names just once
t2 <- setNames(t2, rownames(df)) # by copy
# or even better using `data.table` `setattr` function to
# set names by reference
require(data.table)
tracemem(t2)
setattr(t2, 'names', rownames(df))
tracemem(t2)
Comparing the output doesn't show any other difference between the two (t3
and t2
). You could run this to verify that the outputs are same (time consuming):
all(sapply(names(t2), function(x) all(t2[[x]] == t3[[x]])) == TRUE) # TRUE
Upvotes: 3
Reputation: 13372
You can use which
:
lapply(apply(dat, 1, function(x)which(x>2)), names)
with dat
being your data frame.
[[1]]
character(0)
[[2]]
[1] "A"
[[3]]
[1] "B"
[[4]]
[1] "B" "D"
[[5]]
[1] "C"
EDIT Shorter version suggested by flodel:
lapply(apply(dat > 2, 1, which), names)
Edit: (from Arun)
First, there's no need for lapply
and apply
. You can get the same just with apply
:
apply(dat > 2, 1, function(x) names(which(x)))
But, using apply
on a data.frame
will coerce it into a matrix, which may not be wise if the data.frame is huge.
Upvotes: 6