Nancy
Nancy

Reputation: 4109

Use grep() to select character strings with "XXX-0000" syntax

Given a character vector:

    id.data = c("XXX-2355",
                "XYz-03",
                "XYU-3", 
                "ABC-1234",
                "AX_2356",
                "AbC234")

What is the appropriate way to grep for ONLY the entries that DONT'T follow an "XXX-0000" pattern? In the example above I'd want to end up with only "XXX-2355" and "ABC-1234". There are tens of thousands of records.

I tried selecting by individual issue. For example,

    id.error = rep(NA, length(id.data))
    id.error[-grep("-", id.data)] = "hyphen"

This was obviously really inefficient and I have no way of knowing every possible error. Strplit was useful to a point, but only when I know where to split.

Thanks!

Upvotes: 1

Views: 597

Answers (2)

devnull
devnull

Reputation: 123678

You seem to be looking for invert:

invert logical. If TRUE return indices or values for elements that do not match.

> id.data = c("XXX-2355",
+                 "XYz-03",
+                 "XYU-3",
+                 "ABC-1234",
+                 "AX_2356",
+                 "AbC234")
> grep("[A-Z]{3}-[0-9]{4}", id.data)
[1] 1 4
> grep("[A-Z]{3}-[0-9]{4}", id.data, value = TRUE)
[1] "XXX-2355" "ABC-1234"
> grep("[A-Z]{3}-[0-9]{4}", id.data, invert = TRUE)
[1] 2 3 5 6
> grep("[A-Z]{3}-[0-9]{4}", id.data, invert = TRUE, value = TRUE)
[1] "XYz-03"  "XYU-3"   "AX_2356" "AbC234"
>

Not sure whether you want strings that match the said pattern, or those that don't match. The above example lists both options.

Upvotes: 4

johannes
johannes

Reputation: 14453

One way:

library(stringr)
id.data[str_detect(id.data, "[A-z]{3}-[0-9]{4}")]
> [1] "XXX-2355" "ABC-1234"

Upvotes: 0

Related Questions