Reputation: 5839

is.na() behaves differently than is.numeric() - where's the consistency?

Let's create the data frame:

df <- data.frame(VarA = c(1, NA, 5), VarB = c(NA, 2, 7))

  VarA VarB
1    1   NA
2   NA    2
3    5    7

If I run a simple NA query it shows me the locations of each NA.

is.na(df)

      VarA  VarB
[1,] FALSE  TRUE
[2,]  TRUE FALSE
[3,] FALSE FALSE

Why doesn't is.numeric return the same type of data frame? It only outputs a single "FALSE".

is.numeric(df)

[1] FALSE

Is there a good explanation of data types, classes, etc. somewhere? I read about these things often but don't have a solid feel for them. I don't get the difference between a matrix and data frame, or num vs dbl. It's easy to conflate these things.

I did the Cyclismo "basic data types" tutorial but would like to dig a little deeper.

Upvotes: 4

Answers (3)

Gregor Thomas

Reputation: 146249

First - documentation

Let's turn to the documentation. From ?is.na:

The generic function is.na indicates which elements are missing.

So is.na is made to tell you which individual elements within an object are missing.

From ?is.numeric:

is.numeric is a more general test of an object being interpretable as numbers.

So is.numeric tells you whether an object is numeric (not whether individual elements within the object are numeric).

These are behaving exactly as documented - is.na(df) tells you which elements of the data frame are missing. is.numeric(df) tells you what df is not numeric (in fact, it is a data.frame).

Is it inconsistent?

I can see how this seems inconsistent. There are just a few is.* functions that work element-wise. is.na, is.finite, is.nan are the only ones I can think of. All the other is.* functions work on the whole object. These function are essentially stand-ins for equality testing with == when the equality testing wouldn't work (more on this below). But once you understand the data structures a little more, they don't seem inconsistent, because they really wouldn't make sense the other way.

`is.numeric` makes sense the way it is

It would not make sense for is.numeric to be applied element-wise. A vector is either numeric or not in its entirety - whether or not it has missing values. If you wanted to apply the is.numeric function to each column of your data frame, you could do

sapply(df, is.numeric)

Which will tell you that both columns are numeric. You could make an argument that the default behavior when is.numeric() is given a data frame should be to apply it to every column, but it's possible someone want to make sure that something is a numeric vector, not a data.frame (or anything else), and having, say, a one-column data.frame say TRUE to is.numeric() could cause confusion and errors.

`is.na` makes sense the way it is

Conversely, it wouldn't make sense for is.na to not be applied element-wise. NA is a stand-in for a single value, not a complicated object like a data.frame. It wouldn't really make sense to have a "missing" data frame - you could have a missing value but there's nothing to tell you that it's a data frame. However a data.frame (or a vector, or a matrix...) can contain missing values, and is.na will tell you exactly where they are.

This is pretty much identical to how equality (or other comparisons) work. You could also check for 1s in your data frame with df == 1, or for values less than 5 with df < 5. is.na() is the recommended way to check for missing values - anything == NA returns NA, so df == NA doesn't work for that. is.na(df) is the right way to do this.

To accomplish this, is.na actually has many methods. You can seem them with methods("is.na"). In my current R session, I see

methods("is.na")
 [1] is.na,abIndex-method       is.na,denseMatrix-method   is.na,indMatrix-method    
 [4] is.na,nsparseMatrix-method is.na,nsparseVector-method is.na,sparseMatrix-method 
 [7] is.na,sparseVector-method  is.na.coxph.penalty*       is.na.data.frame          
[10] is.na.data.table*          is.na.integer64*           is.na.numeric_version     
[13] is.na.POSIXlt              is.na.raster*              is.na.ratetable*          
[16] is.na.Surv*

This shows me that all these different types of objects support a is.na() call to nicely tell me where missing values are inside of them. And if I call it on another object class, then is.na.default will try to handle it.

Secondary questions

I don't get the difference between a matrix and data frame, or num vs dbl. It's easy to conflate these things.

num vs dbl is not relevant to R. I'm shocked that anything directed at R beginners would mention doubles - it shouldn't. If you look at the help at ?double it includes.

It is identical to numeric.

... as.double is a generic function. It is identical to as.numeric.

For R purposes, forget the term double and just use numeric.

I don't get the difference between a matrix and data frame

Both are rectangular - rows and columns. A matrix can only have one data type/class inside it - the whole matrix is numeric, or character, or integer, etc, with no mixing. A data.frame can have different class for each of its columns, the first column can be numeric, the second character, the third factor, etc.

Matrices are simpler and more efficient, very suitable for linear algebra operations. Data frames are much more common because it is common to have data of mixed types.

Upvotes: 5

Carl Boneri

Reputation: 2722

str(df)
'data.frame':   3 obs. of  2 variables:
$ VarA: num  1 NA 5
$ VarB: num  NA 2 7

The thing to consider is this, is.na is testing each value that appears in a vector... whereas is.numeric is checking the class of the object itself. It's apples-to-oranges in a sense. Think of it like this,

Is this object Not Available(NA)? Since it exists, check each object contained in the tested vectors. Is this object a number? Nope.. it's a data.frame

Upvotes: 0

Ben Bolker

Reputation: 227081

Primarily because the test in is.numeric() applies to the whole object (so returns a single value that says whether the entire object is numeric), while is.na() applies to individual elements of the object.

The next, subtler question (which you haven't asked yet but might ask next) is: why doesn't is.numeric() return TRUE, since all the elements of the data frame are numeric? It's because data frames are internally represented as lists, and could contain elements of different types (is.numeric(as.matrix(df)) does return TRUE).

Upvotes: 4

is.na() behaves differently than is.numeric() - where&#39;s the consistency?