TheGoat
TheGoat

Reputation: 2877

String matching with GREPL and WITH functions in R

I wish to drop any rows in my data frame which do not meet the regular expression in a specific column i.e the cell must begin with two characters followed by four numbers, after that I do not care if it's U09 or U21. The first 6 characters are all that matters.

I am using the following code but I am getting 0 rows returned and I am not sure why:

with(prachData, prachData[grepl("^[A-Z][A-Z][0-9]{4}$", WCEL.name), ])

When I type head(prachData$WCEL.name) I get the following details:

> head(prachData$WCEL.name)
[1] 0           0           CE0001U21B2 CE0001U21A3 CE0001U21C1 CE0001U21B1
13684 Levels: 0 1 11 12 13 2 21 22 23 3 31 32 33 CE0001U09A3 CE0001U09B3 CE0001U09C3 CE0001U21A1 CE0001U21A2 ... WX0114U09C3

And using class(prachData$WCEL.name) I get:

[1] "factor"

Can anyone guide me to my mistake?

Upvotes: 1

Views: 795

Answers (1)

akrun
akrun

Reputation: 887891

The problem seems to be the use $ in the pattern which means the end of string as it is a metacharacter, but based on the input showed, that is not the case because after the 4 digits there are other characters as well i.e. the string is not ending with only 6 characters, so grepl will return FALSE for the OP's pattern. Instead it can be,

with(prachData, prachData[grepl("^[A-Z][A-Z][0-9]{4}", WCEL.name), ])

To show a reproducible example

v1 <- factor(c(0,           0,           'CE0001U21B2', 'CE0001U21A3', 
                 'CE0001U21C1', 'CE0001U21B1'))
grepl("[A-Z]{2}[0-9]{4}$", v1)
#[1] FALSE FALSE FALSE FALSE FALSE FALSE

returns all FALSE

So, when we subset 'v1' based on the above index,

v1[grepl("[A-Z]{2}[0-9]{4}$", v1)]
#factor(0)
#Levels: 0 CE0001U21A3 CE0001U21B1 CE0001U21B2 CE0001U21C1

it returns a length of 0.

while

grepl("[A-Z]{2}[0-9]{4}", v1)
#[1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE

Upvotes: 4

Related Questions