user5813583
user5813583

Reputation: 133

Filter by string format in R

I have an ID column that should always be formatted ABCDE123 - Five letters and three numbers, no gap no symbols.

I know for sure there are a number of rows that don't correctly follow this format. Is it possible to filter by the string format in R, so that I can identify those rows and review them?

Tidyverse is preferred, but any solution would be helpful!

Upvotes: 2

Views: 533

Answers (1)

akrun
akrun

Reputation: 887223

If these are 5 upper case letters followed by 3 digits, specify regex to match 5 upper case letters [A-Z]{5} from the start (^) of the string followed by 3 digits ([0-9]{3}) at the end ($) of the string in str_detect to return a logical vector which is used in filtering the rows of the data

library(dplyr)
library(stringr)
df1 %>%
    filter(str_detect(ID, '^[A-Z]{5}[0-9]{3}$'))

If these rows should be removed, specify negate = TRUE in str_detect

df1 %>%
    filter(str_detect(ID, '^[A-Z]{5}[0-9]{3}$', negate = TRUE))

Or as @BenBolker mentioned in the comments [[:upper:]]{5} would be more generic compared to [A-Z]{5}

Upvotes: 3

Related Questions