stites
stites

Reputation: 5143

grep behavior is odd for NA or "" entries

I am fairly new to R and am working with a vector with empty entries and noticed that grep acts counter-intuitively with my data. I'm just going to work with an example as I am not 100% sure how to explain it. Say I have three vectors:

A<-c("","","","","","","a")
B<-c(NA,NA,NA,NA,NA,NA,"a")

A is how the data was stored originally, and B is how R is reading my data. Running > vec[grep("",vec, invert=TRUE)] -to my understanding- searches vec for all empty cells, return their indices, then populates and displays a result vector with non-empty data entries. However when I run this for vec=A and vec=B I get:

vec = A:

> A[grep("",A, invert=FALSE)]
[1] "" "" "" "" "" "" "" "a" 
> A[grep("",A, invert=TRUE)]
character(0)

vec = B:

> B[grep("",B, invert=FALSE)]
[1] "a"
> B[grep("",B, invert=TRUE)]
[1] NA NA NA NA NA NA

Since I thought my data was being read like case B I was stumped by the counter-intuitive result. I realize this could simply be a variable-type issue however I was wondering if someone could shed some more light on the situation as to what is going on.

quick edit Case A makes sense: since grep can't find "" because the variable types are off, it returns everything. Inverted, it returns character(0) as the default for "nothing". Still confused by case B.

Upvotes: 5

Views: 8129

Answers (2)

mathematical.coffee
mathematical.coffee

Reputation: 56915

Note that grep performs regular expression searches (not string matching).

The regex "" that you have fed in is empty, so running grep asks if any of the strings it is matching against contains "", not whether the string entirely matches "".

For example,

grepl("a","bananas")

returns TRUE because "a" is in "bananas".

If you want to match the entire string against "", you can use '^' and '$' in your regex ('^' means start of string, '$' means end of string):

grepl("^$", "") # returns TRUE
grepl("^$", "a") # returns FALSE

However you're probably better off not using regex at all if it's just empty cells you want:

A[A != ""] # returns "a"
B[!is.na(B)] # returns "a"

Upvotes: 10

IRTFM
IRTFM

Reputation: 263362

For your first question:

> A[grep("^$", A)]
[1] "" "" "" "" "" ""
> A[grep("^$", A, invert=TRUE)]
[1] "a"

Your use of "" as a pattern is picking up any character element. The use of "^$"is picking up the locations of character elements where there are no characters between the beginning and end.

Just as NA does not "==" anything (even itself), so to does NA not match "".

Upvotes: 4

Related Questions