Reputation: 4109
Given a character vector:
id.data = c("XXX-2355",
"XYz-03",
"XYU-3",
"ABC-1234",
"AX_2356",
"AbC234")
What is the appropriate way to grep for ONLY the entries that DONT'T follow an "XXX-0000" pattern? In the example above I'd want to end up with only "XXX-2355" and "ABC-1234". There are tens of thousands of records.
I tried selecting by individual issue. For example,
id.error = rep(NA, length(id.data))
id.error[-grep("-", id.data)] = "hyphen"
This was obviously really inefficient and I have no way of knowing every possible error. Strplit was useful to a point, but only when I know where to split.
Thanks!
Upvotes: 1
Views: 597
Reputation: 123678
You seem to be looking for invert
:
invert
logical. IfTRUE
return indices or values for elements that do not match.
> id.data = c("XXX-2355",
+ "XYz-03",
+ "XYU-3",
+ "ABC-1234",
+ "AX_2356",
+ "AbC234")
> grep("[A-Z]{3}-[0-9]{4}", id.data)
[1] 1 4
> grep("[A-Z]{3}-[0-9]{4}", id.data, value = TRUE)
[1] "XXX-2355" "ABC-1234"
> grep("[A-Z]{3}-[0-9]{4}", id.data, invert = TRUE)
[1] 2 3 5 6
> grep("[A-Z]{3}-[0-9]{4}", id.data, invert = TRUE, value = TRUE)
[1] "XYz-03" "XYU-3" "AX_2356" "AbC234"
>
Not sure whether you want strings that match the said pattern, or those that don't match. The above example lists both options.
Upvotes: 4
Reputation: 14453
One way:
library(stringr)
id.data[str_detect(id.data, "[A-z]{3}-[0-9]{4}")]
> [1] "XXX-2355" "ABC-1234"
Upvotes: 0