nachocab
nachocab

Reputation: 14454

How to work with the rows of a data frame without coercing it into a character vector?

I have this data frame:

df <- data.frame(
  a = c(0, 1, 0, 1),
  b = c("a", "b", "c", "d")
)
#   a b
# 1 0 a
# 2 1 b
# 3 0 c
# 4 1 d

Let's say I want to test each row for a condition and return either "ok" or "not ok". This should work:

apply(df, 1, function(row){
    if (is.numeric(row[1]) & row[2] != "b") {
        "ok"
    } else {
        "not ok"
    }
})
# I should return: "ok" "not ok" "ok" "ok"

Unfortunately apply coerces the dataframe to a single type, so everything is seen as a character, so this is the output I get:

# "not ok" "not ok" "not ok" "not ok"

Is there a way to go through the rows of a dataframe preserving the data types? Maybe using dplyr::do or purrr::map?

Update

I know the conditions in the example don't make a lot of sense, but I was trying to simplify a more complex condition. I want to avoid using nested ifelse statements because they are not very readable.

Upvotes: 2

Views: 447

Answers (2)

tospig
tospig

Reputation: 8343

The first half of this answer is expanding and trying to explain @Joran's excellent comment/answer, which is mainly an exercise for me and my understanding, but hopefully it helps someone else too. (and I'm happy to have my understanding corrected).

The second half shows a couple of other non-base solutions that could be used in more complex situations.

Joran's answer

c('not ok','ok')[(is.numeric(df[[1]]) & (df[[2]] != 'b')) + 1]

From ?data.frame

A data frame is a list of variables

so, each column/variable in the data.frame is a list

From ?[ and this question on the difference between [ and [[ we note that

For lists, one generally uses [[ to select any single element, whereas [ returns a list of the selected elements.

Therefore, using [[ in this solution selects a single element of the the list

df[[1]]    ## select the 1st column as a single element (which is a vector)
# [1] 0 1 0 1
df[[2]]    ## select the 2nd column as a single element (which is a vector)
# [1] a b c d 

## note that df[1] would return the first column as a data.frame (which is a list), not a vector
## we can see that by 
# > str(df[1])
# 'data.frame': 4 obs. of  1 variable:
#   $ a: num  0 1 0 1
# > str(df[[1]])
# num [1:4] 0 1 0 1

With these two vectors now selected we can perform the vectorised logical check on each element within them

is.numeric(df[[1]]) & (df[[2]] != 'b')
# TRUE FALSE TRUE TRUE

From ?logical we have

...with TRUE being mapped to 1L, FALSE to 0L...

so essentially TRUE == 1L and FALSE == 0L, which we can see by

sum(c(TRUE, TRUE, FALSE, TRUE))
# [1] 3

Now, taking a vector of our choices

c("not ok", "ok")
# [1] "not ok" "ok"

we can use [ again to select each element

c("not ok", "ok")[1]
# [1] "not ok"
c("not ok", "ok")[2]
# [1] "ok"
c("not ok", "ok")[3]
# [1] NA
## Because there isn't a 3rd element
c("not ok", "ok")[0]
# character(0)    ## empty
## and we can use a vector to select each element
c("not ok", "ok")[c(1,2,1,3)]
# [1] "not ok" "ok"     "not ok" NA 

Which also means we can use our logical comparison from earlier to subset the choices. However, as FALSE is mapped to 0L, we need to add 1 to it so it will be able to select from the vector

c(TRUE, TRUE, FALSE, TRUE) + 1
# [1] 2 2 1 2

which gives

c("not ok", "ok")[c(2,2,1,2)]
# [1] "ok"     "ok"     "not ok" "ok" 

Which now gives us the information we want to include in our original data.frame

df$c <- c("not ok", "ok")[c(2,2,1,2)]
# a b      c
# 1 0 a     ok
# 2 1 b     ok
# 3 0 c not ok
# 4 1 d     ok

Non-base solutions

## a dplyr version, still using ifelse construct
library(dplyr)
df %>%
  mutate(c = ifelse(is.numeric(a) & b != "b", "ok", "not ok")) 

## a couiple of data.table versions using by reference udpates (:=)
library(data.table)
## using an ifelse
setDT(df)[, c := ifelse(is.numeric(a) & b != "b", "ok", "not ok")]

## using filters in i
setDT(df)[is.numeric(a) & b != "b", c := "ok"][is.na(c), c := "not ok"]

Upvotes: 1

Stibu
Stibu

Reputation: 15927

A solution with ifelse() has been suggested in the comments and this is of course fine in your case:

df$c <- ifelse(is.numeric(df$a) & df$b != "b", "ok", "not ok")
 df
##   a b      c
## 1 0 a     ok
## 2 1 b not ok
## 3 0 c     ok
## 4 1 d     ok

But your more general question is how to apply a function over the rows of a data frame without converting it to a matrix. A possible way to do this, is to use lapply (or one of the others) over row indices:

df$c <- vapply(1:nrow(df), function(i){
             if (is.numeric(df[i, 1]) & df[i, 2] != "b") {
               "ok"
             } else {
               "not ok"
             }
           }, character(1))
##  df
##   a b      c
## 1 0 a     ok
## 2 1 b not ok
## 3 0 c     ok
## 4 1 d     ok

Again, in your situation, ifelse() is just fine. But if you want to do something more complicated with the rows of your data frame, applying over row indices might be the way to go.

Upvotes: 2

Related Questions