Mike Williamson
Mike Williamson

Reputation: 3178

R apply function across rows, unexpected answer

I don't understand what is going on here:

Set up:

> df = data.frame(x1= rnorm(10), x2= rnorm(10))
> df[3,1] <- "the"
> df[6,2] <- "NA"
## I want to create values that will be challenging to coerce to numeric
> df$x1.fixed <- as.numeric(df$x1)
> df$x2.fixed <- as.numeric(df$x2)
## Here is the DF
> df
                   x1                 x2   x1.fixed   x2.fixed
1   0.955965351551298 -0.320454533088042  0.9559654 -0.3204545
2   -1.87960909714257   1.61618672247496 -1.8796091  1.6161867
3                 the -0.855930398468875         NA -0.8559304
4  -0.400879592905882 -0.698655375066432 -0.4008796 -0.6986554
5   0.901252404134257  -1.08020133150191  0.9012524 -1.0802013
6    0.97786920899034                 NA  0.9778692         NA
.
.
.
> table(is.na(df[,c(3,4)]))

FALSE  TRUE 
   18     2 

I wanted to find the rows that got converted to NAs, so I put in a complex apply that did not work as expected. I then simplified and tried again...

Question:

Simpler call:

> apply(df, 1, function(x) (any(is.na(df[x,3]), is.na(df[x,4]))))

which unexpectedly yielded:

[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Instead, I'd expected:

[1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE

to highlight the rows (3 & 6) where an NA existed. To verify that non-apply'ed functions would work, I tried:

> any(is.na(df[3,1]), is.na(df[3,2]))
[1] FALSE
> any(is.na(df[3,3]), is.na(df[3,4]))
[1] TRUE

as expected. To further my confusion on what apply is doing, I tried:

> apply(df, 1, function(x) is.na(df[x,1]))
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE
[2,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE
[3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE
[4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE

Why is this traversing the entire DF, when I have clearly indicated both (a) that I want it in the row direction (I passed "1" into the second parameter), and (b) the value "x" is only placed in the row id, not the column id?

I understand there are other, and perhaps better, ways to do what I am trying to do (find the rows that have been changed to NA's in the new columns. But please don't supply that in the answer. Instead, please explain why apply did not work as I'd expected, and what I could do to fix it.

Upvotes: 0

Views: 295

Answers (3)

talat
talat

Reputation: 70256

You could use

rowSums(is.na(df))>0
[1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

to find the rows containing NAs.

I'm not sure, but I think this is a vectorized operation which might be faster than using apply in case you are working with large data.

Upvotes: 0

eddi
eddi

Reputation: 49448

To find the columns that have NA's you can do:

sapply(df, function(x) any(is.na(x)))
#      x1       x2 x1.fixed x2.fixed 
#   FALSE    FALSE     TRUE     TRUE 

A data.frame is a list of vectors, so the above function inside sapply will evaluate any(is.na( for each element of that list, i.e. each column.

As per OP edit - to get the rows that have NA's, use apply(df, 1, ... instead:

apply(df, 1, function(x) any(is.na(x)))
# [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

Upvotes: 2

joran
joran

Reputation: 173547

apply is working exactly as it is supposed to. It is your expectations that are wrong.

apply(df, 1, function(x) is.na(df[x,1]))

The first thing that apply does (per the documentation) is coerce your data frame to a matrix. In the process, all numeric columns are coerced to character.

Next, each individual row of df is passed as the argument x to your function. In what sense is it meaningful to index df by the character values in the first row in df? So you just get a bunch of NAs. You can test this via:

> df[as.character(df[1,]),]
       x1   x2 x1.fixed x2.fixed
NA   <NA> <NA>       NA       NA
NA.1 <NA> <NA>       NA       NA
NA.2 <NA> <NA>       NA       NA
NA.3 <NA> <NA>       NA       NA

You say you want to know which columns introduced NAs, and yet you are applying over rows. If you really wanted to use apply (I recommend @eddi's method) you could do:

apply(df,2,function(x) any(is.na(x)))

Upvotes: 1

Related Questions