Reputation: 5143
I am fairly new to R and am working with a vector with empty entries and noticed that grep acts counter-intuitively with my data. I'm just going to work with an example as I am not 100% sure how to explain it. Say I have three vectors:
A<-c("","","","","","","a")
B<-c(NA,NA,NA,NA,NA,NA,"a")
A
is how the data was stored originally, and B
is how R is reading my data. Running > vec[grep("",vec, invert=TRUE)]
-to my understanding- searches vec
for all empty cells, return their indices, then populates and displays a result vector with non-empty data entries. However when I run this for vec=A
and vec=B
I get:
vec = A:
> A[grep("",A, invert=FALSE)]
[1] "" "" "" "" "" "" "" "a"
> A[grep("",A, invert=TRUE)]
character(0)
vec = B:
> B[grep("",B, invert=FALSE)]
[1] "a"
> B[grep("",B, invert=TRUE)]
[1] NA NA NA NA NA NA
Since I thought my data was being read like case B I was stumped by the counter-intuitive result. I realize this could simply be a variable-type issue however I was wondering if someone could shed some more light on the situation as to what is going on.
quick edit Case A makes sense: since grep can't find "" because the variable types are off, it returns everything. Inverted, it returns character(0) as the default for "nothing". Still confused by case B.
Upvotes: 5
Views: 8129
Reputation: 56915
Note that grep
performs regular expression searches (not string matching).
The regex ""
that you have fed in is empty, so running grep
asks if any of the strings it is matching against contains ""
, not whether the string entirely matches "".
For example,
grepl("a","bananas")
returns TRUE
because "a" is in "bananas".
If you want to match the entire string against ""
, you can use '^' and '$' in your regex ('^' means start of string, '$' means end of string):
grepl("^$", "") # returns TRUE
grepl("^$", "a") # returns FALSE
However you're probably better off not using regex at all if it's just empty cells you want:
A[A != ""] # returns "a"
B[!is.na(B)] # returns "a"
Upvotes: 10
Reputation: 263362
For your first question:
> A[grep("^$", A)]
[1] "" "" "" "" "" ""
> A[grep("^$", A, invert=TRUE)]
[1] "a"
Your use of ""
as a pattern is picking up any character element. The use of "^$"
is picking up the locations of character elements where there are no characters between the beginning and end.
Just as NA does not "=="
anything (even itself), so to does NA not match ""
.
Upvotes: 4