Reputation: 3178
I don't understand what is going on here:
> df = data.frame(x1= rnorm(10), x2= rnorm(10))
> df[3,1] <- "the"
> df[6,2] <- "NA"
## I want to create values that will be challenging to coerce to numeric
> df$x1.fixed <- as.numeric(df$x1)
> df$x2.fixed <- as.numeric(df$x2)
## Here is the DF
> df
x1 x2 x1.fixed x2.fixed
1 0.955965351551298 -0.320454533088042 0.9559654 -0.3204545
2 -1.87960909714257 1.61618672247496 -1.8796091 1.6161867
3 the -0.855930398468875 NA -0.8559304
4 -0.400879592905882 -0.698655375066432 -0.4008796 -0.6986554
5 0.901252404134257 -1.08020133150191 0.9012524 -1.0802013
6 0.97786920899034 NA 0.9778692 NA
.
.
.
> table(is.na(df[,c(3,4)]))
FALSE TRUE
18 2
I wanted to find the rows that got converted to NAs, so I put in a complex apply that did not work as expected. I then simplified and tried again...
Simpler call:
> apply(df, 1, function(x) (any(is.na(df[x,3]), is.na(df[x,4]))))
which unexpectedly yielded:
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Instead, I'd expected:
[1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
to highlight the rows (3 & 6) where an NA
existed. To verify that non-apply
'ed functions would work, I tried:
> any(is.na(df[3,1]), is.na(df[3,2]))
[1] FALSE
> any(is.na(df[3,3]), is.na(df[3,4]))
[1] TRUE
as expected. To further my confusion on what apply
is doing, I tried:
> apply(df, 1, function(x) is.na(df[x,1]))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[2,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Why is this traversing the entire DF, when I have clearly indicated both (a) that I want it in the row direction (I passed "1" into the second parameter), and (b) the value "x" is only placed in the row id, not the column id?
I understand there are other, and perhaps better, ways to do what I am trying to do (find the rows that have been changed to NA's in the new columns. But please don't supply that in the answer. Instead, please explain why apply
did not work as I'd expected, and what I could do to fix it.
Upvotes: 0
Views: 295
Reputation: 70256
You could use
rowSums(is.na(df))>0
[1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
to find the rows containing NA
s.
I'm not sure, but I think this is a vectorized operation which might be faster than using apply
in case you are working with large data.
Upvotes: 0
Reputation: 49448
To find the columns that have NA's you can do:
sapply(df, function(x) any(is.na(x)))
# x1 x2 x1.fixed x2.fixed
# FALSE FALSE TRUE TRUE
A data.frame
is a list of vectors, so the above function inside sapply
will evaluate any(is.na(
for each element of that list, i.e. each column.
As per OP edit - to get the rows that have NA's, use apply(df, 1, ...
instead:
apply(df, 1, function(x) any(is.na(x)))
# [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
Upvotes: 2
Reputation: 173547
apply
is working exactly as it is supposed to. It is your expectations that are wrong.
apply(df, 1, function(x) is.na(df[x,1]))
The first thing that apply
does (per the documentation) is coerce your data frame to a matrix. In the process, all numeric columns are coerced to character.
Next, each individual row of df
is passed as the argument x
to your function. In what sense is it meaningful to index df
by the character values in the first row in df
? So you just get a bunch of NA
s. You can test this via:
> df[as.character(df[1,]),]
x1 x2 x1.fixed x2.fixed
NA <NA> <NA> NA NA
NA.1 <NA> <NA> NA NA
NA.2 <NA> <NA> NA NA
NA.3 <NA> <NA> NA NA
You say you want to know which columns introduced NA
s, and yet you are apply
ing over rows. If you really wanted to use apply
(I recommend @eddi's method) you could do:
apply(df,2,function(x) any(is.na(x)))
Upvotes: 1