filtering some strings but some of them not! with grepl

Question

I am trying to filter some strings in the data. For example I want to filter out 'AxxBy' strings but there is this string 'AxxByy' I want to keep! x and y stands for number of digits!

Here is what I tried,

data <- data.frame(pair=paste(paste('A',c(seq(1:4),10,11),sep=''),paste('B',c(2,3,4,22,33,44),sep=''),sep='')) 
    pair
1   A1B2
2   A2B3
3   A3B4
4  A4B22
5 A10B33
6 A11B44

I want to remove those pairs starting with A1 but not A10 and A11. Same as for also B2 but keep B22! etc.

x <- c(paste('A',1,sep=''), paste('B',2,sep='')) # filtering conditions

library(dplyr)
df <- data%>%
  filter(!grepl(paste(x,collapse='|'),pair))

 pair
1 A2B3
2 A3B4

In this post Filtering observations in dplyr in combination with grepl it is possible to add line starting with "^x|xx$" by regex functions but I haven't seen any post if the filtering conditions defined outside of the pipe.

Expected output

The thumb of rule is that; if there is two digits after 'A' put B so AxxB and !grepl everything for defined xx numbers in the x input. if there is only 'B' and one digit which is 'By' is given !grepl 'By$' not 'Byy' inputs. Of course this includes 'AxBy$' and 'AxxBy$' that's all. I still cannot generalize @alistaire solution!

Uwe · Accepted Answer

The OP has requested to filter out 'AxxBy' strings but wants to keep string 'AxxByy' (where 'x' and 'y' denote digits.

Often it is easier to specify what to keep than what to remove. To keep strings which obey the pattern 'AxxByy' the regular expression

"^A\d{2}B\d{2}$"

can be used where ^ denotes the begin of the string, \d{2} a sequence of exactly two digits, and $ the end of the string. A and B stand for themselves.

With this regular expression, dplyr, and grepl() can be used to filter the input data frame DF:

library(dplyr)
#which rows are kept?
kept <- DF %>%
+   filter(grepl("^A\d{2}B\d{2}$", pair))
kept
#    pair
#1 A10B33
#2 A11B44

# which rows are removed?
removed <- DF %>%
+   filter(!grepl("^A\d{2}B\d{2}$", pair))
removed
#      pair
#1     A1B2
#2     A2B3
#3     A3B4
#4    A4B22
#5       AB
#6        A
#7        B
#8       A1
#9      A12
#10      B1
#11     B12
#12 AA12B34
#13 A12BB34

Note that I've added some edge cases for demonstration.

BTW: dplyr is not required if only the vector pair needs to be filtered. So, in base R the alternative expressions

pair[grepl("^A\d{2}B\d{2}$", pair)]
grep("^A\d{2}B\d{2}$", pair, value = TRUE)

both return the strings to keep:

[1] "A10B33" "A11B44"

while

pair[!grepl("^A\d{2}B\d{2}$", pair)]

returns the removed strings:

 [1] "A1B2"    "A2B3"    "A3B4"    "A4B22"   "AB"      "A"       "B"       "A1"     
 [9] "A12"     "B1"      "B12"     "AA12B34" "A12BB34"

Data

As given by the OP but with some edge cases appended:

# create vector of test patterns using paste0() instead of paste(..., sep = "")
pair <- paste0("A", c(1:4, 10, 11), "B", c(2, 3, 4, 22, 33, 44))
# alternatvely use sprintf()
pair <- sprintf("A%iB%i", c(1:4, 10, 11), c(2, 3, 4, 22, 33, 44))
# add some edge cases
pair <- append(pair, c("AB", "A", "B", "A1", "A12", "B1", "B12", "AA12B34", "A12BB34"))
# create data frame
DF <- data.frame(pair)
DF
#      pair
#1     A1B2
#2     A2B3
#3     A3B4
#4    A4B22
#5   A10B33
#6   A11B44
#7       AB
#8        A
#9        B
#10      A1
#11     A12
#12      B1
#13     B12
#14 AA12B34
#15 A12BB34

filtering some strings but some of them not! with grepl

Answers (1)

Data

Related Questions