Reputation: 37
I have 2 problems:
Problem 1: I am trying to work out how to identify any common missing value formats like NA, " ", "".
I thought is.na would identify all of these formats, can someone point me in the right direction for what I need to do here?
Problem 2: I need to count the NA, " " and "" values and list the position for all of them.
Ive tried:
```{r, echo=TRUE,include=TRUE}
sum(is.na(DF))
which(is.na(DF))
```
but it only counts the NA values (16) and tells me which value position they are in.
However, I also happen to know there are 10 values in my dataset that are missing and their format isnt NA, its " ", so the total for missing values should be 26 and I should get the value position for all of them.
I tried using something like:
sum(is.na(DF, na.strings=c("NA"," ","")))
But I got this error: Error in is.na(DF, na.strings = c("NA", " ", "")) : 2 arguments passed to 'is.na' which requires 1
Any ideas on what to do here would be amazing as well.
Thank you!
Upvotes: 1
Views: 599
Reputation: 51974
is.na
only detects NA values, not " "
nor ""
. You can convert " "
and ""
to NA using gsub
, and then use is.na
:
v = c(NA, "", " ", "A")
gsub("^$|^ $", NA, v)
# [1] NA NA NA "A"
sum(is.na(gsub("^$|^ $", NA, v)))
# [1] 3
which(is.na(gsub("^$|^ $", NA, v)))
# [1] 1 2 3
Explanation: ^$
captures empty string (^
defines the beginning of the string and $
the end). ^ $
captures a string with one space (with the same anchors having the same purpose), and |
is the OR operator.
Upvotes: 4