Jibril
Jibril

Reputation: 1037

Understanding R - is.na and blank "" cells

I have a dataset. Previous to a lot of different file manipulations, many cells were "NA"

After the manipulations, for whatever reason, they have all become just purely empty. So, to be clear, the SAME cells that were previously NA in excel now just show up totally blank. No big deal, right?

Well, when I read the data into R I get...

 [1] ""                  ""                  "6.4019975396e+17" 
 [4] ""                  ""                  ""                 
 [7] ""                  ""                  "6.40275087015e+17"
[10] "6.4062774821e+17"  ""                  "6.40602341e+17"   
[13] ""                  ""                  "6.40360673735e+17"
[16] "6.40326194081e+17" "6.40326465381e+17" "6.40322363352e+17"

Still seems fine to me, except when I run

is.na(data_frame$column_name)

I get ALL FALSE. Every single one. Am I misunderstanding how is.na works?

EDIT - This was kind of vague. Of course I am misunderstanding how it works. Can you explain why an empty cell does not count as an NA cell? Is there a quick-fix that can be applied to a data frame to make anything that is "" or what would be a blank cell in a CSV to NA for R's sake?

Upvotes: 6

Views: 42777

Answers (2)

Ankit Katiyar
Ankit Katiyar

Reputation: 3001

I believe not only R but in general programming languages also treat empty "" and NA (in some null) as differently.

NA is the value where nothing was provided or value is assigned. "" empty is a string value. that means there is an empty string.

I just found one interesting article about looking at a dataset, you can look at summaries of columns of a dataset in one go http://www.bytefold.com/generate-metadata-for-a-dataset-in-r/

Upvotes: 0

Gregor Thomas
Gregor Thomas

Reputation: 145765

Can you explain why an empty cell does not count as an NA cell?

I think, in short, the answer is that to R NA and empty "" are different. The why of it is that "" is a blank, and NA is something that is truly missing---you have no idea what it is, it could be anything.

To replace blanks with NA, post-hoc, for a single column you could do

data$column[data$column == ""] <- NA

To do that for all columns in a data frame

data = lapply(data, function(x) {x[x == ""] <- NA})

As pointed out in comments, the best time to address the problem is when you read the data in, with the na.strings argument of read.csv or read.table.

read.csv(file_name, na.strings = c("", "NA"))

Upvotes: 10

Related Questions