elJorge
elJorge

Reputation: 163

Select names of columns which contain specific values in row

I'm using a data.frame:

        data.frame("A"=c(NA,5,NA,NA,NA),
                   "B"=c(1,2,3,4,NA),
                   "C"=c(NA,NA,NA,2,3),
                   "D"=c(NA,NA,NA,7,NA))

This delivers a data.frame in this form:

   A  B  C  D
1 NA  1 NA NA
2  5  2 NA NA
3 NA  3 NA NA
4 NA  4  2  7
5 NA NA  3 NA

My aim is to check each row of the data.frame, if there is a value greater than a specific one (let's assume 2) and to get the name of the columns where this is the case.

The desired output (value greater 2) should be:

for row 1 of the data.frame
x[1,]: c()

for row 2
x[2,]: c("A")

for row3
x[3,]: c("B")

for row4
x[4,]: c("B","D")

and for row5 of the data.frame
x[5,]: c("C")

Thanks for your help!

Upvotes: 6

Views: 6636

Answers (3)

why not do

colnames(df[,df[i,]>2])

for each row, where df is your data frame and i is the row number ;)

Upvotes: 1

Arun
Arun

Reputation: 118779

To answer @flodel's concerns, I'll write it as a separate answer:

1) Using lapply gets a list and apply doesn't guarantee this always:

A fair point. I'll illustrate the issue with an example:

df <- structure(list(A = c(3, 5, NA, NA, NA), B = c(1, 2, 3, 1, NA), 
    C = c(NA, NA, NA, 2, 3), D = c(NA, NA, NA, 7, NA)), .Names = c("A", 
"B", "C", "D"), row.names = c(NA, -5L), class = "data.frame")

   A  B  C  D
1  3  1 NA NA
2  5  2 NA NA
3 NA  3 NA NA
4 NA  1  2  7
5 NA NA  3 NA

# using `apply` results in a vector:
apply(df, 1, function(x) names(which(x>2)))
# [1] "A" "A" "B" "D" "C"

So, how can we guarantee a list with apply?

By creating a list within the function argument and then use unlist with recursive = FALSE, as shown below:

unlist(apply(df, 1, function(x) list(names(which(x>2)))), recursive=FALSE)
[[1]]
[1] "A"

[[2]]
[1] "A"

[[3]]
[1] "B"

[[4]]
[1] "D"

[[5]]
[1] "C"

2) lapply is overall shorter, and does not require anonymous function:

Yes, but it's slower. Let me illustrate this on a big example.

set.seed(45)
df <- as.data.frame(matrix(sample(c(1:10, NA), 1e5 * 100, replace=TRUE), 
               ncol = 100))

system.time(t1 <- lapply(apply(df > 2, 1, which), names))
   user  system elapsed 
  5.025   0.342   5.651 

system.time(t2 <- unlist(apply(df, 1, function(x) 
            list(names(which(x>2)))), recursive=FALSE))
   user  system elapsed 
  2.860   0.181   3.065 

identical(t1, t2) # TRUE

3) All answers are wrong and the answer that'll work with all inputs:

lapply(split(df, rownames(df)), function(x)names(x)[which(x > 2)])

First, I don't get as to what's wrong. If you're talking about the list being unnamed, this can be changed by just setting the names just once at the end.

Second, unfortunately, using split on a huge data.frame which will result in too many split elements will be terribly slow (due to huge factor levels).

# testing on huge data.frame
system.time(t3 <- lapply(split(df, rownames(df)), function(x)names(x)[which(x > 2)]))
   user  system elapsed
517.545   0.312 517.872

Third, this orders the elements as 1, 10, 100, 1000, 10000, 100000, ... instead of 1 .. 1e5. Instead one could just use setNames or setnames (from data.table package) to just do this once finally, as shown below:

# setting names just once
t2 <- setNames(t2, rownames(df)) # by copy

# or even better using `data.table` `setattr` function to 
# set names by reference
require(data.table)
tracemem(t2)
setattr(t2, 'names', rownames(df))
tracemem(t2)

Comparing the output doesn't show any other difference between the two (t3 and t2). You could run this to verify that the outputs are same (time consuming):

all(sapply(names(t2), function(x) all(t2[[x]] == t3[[x]])) == TRUE) # TRUE

Upvotes: 3

user1981275
user1981275

Reputation: 13372

You can use which:

lapply(apply(dat, 1, function(x)which(x>2)), names)

with dat being your data frame.

[[1]]
character(0)

[[2]]
[1] "A"

[[3]]
[1] "B"

[[4]]
[1] "B" "D"

[[5]]
[1] "C"

EDIT Shorter version suggested by flodel:

lapply(apply(dat > 2, 1, which), names)

Edit: (from Arun)

First, there's no need for lapply and apply. You can get the same just with apply:

apply(dat > 2, 1, function(x) names(which(x)))

But, using apply on a data.frame will coerce it into a matrix, which may not be wise if the data.frame is huge.

Upvotes: 6

Related Questions