Reputation: 4635
I am trying to filter some strings in the data. For example I want to filter out 'AxxBy' strings but there is this string 'AxxByy' I want to keep! x and y stands for number of digits!
Here is what I tried,
data <- data.frame(pair=paste(paste('A',c(seq(1:4),10,11),sep=''),paste('B',c(2,3,4,22,33,44),sep=''),sep=''))
pair
1 A1B2
2 A2B3
3 A3B4
4 A4B22
5 A10B33
6 A11B44
I want to remove those pairs starting with A1 but not A10 and A11. Same as for also B2 but keep B22! etc.
x <- c(paste('A',1,sep=''), paste('B',2,sep='')) # filtering conditions
library(dplyr)
df <- data%>%
filter(!grepl(paste(x,collapse='|'),pair))
pair
1 A2B3
2 A3B4
In this post Filtering observations in dplyr in combination with grepl
it is possible to add line starting with "^x|xx$"
by regex functions but I haven't seen any post if the filtering conditions defined outside of the pipe.
Expected output
pair
1 A2B33
2 A3B4
3 A4B22
4 A10B33
6 A11B44
The thumb of rule is that; if there is two digits after 'A' put B so AxxB and !grepl everything for defined xx numbers in the x
input. if there is only 'B' and one digit which is 'By' is given !grepl 'By$' not 'Byy' inputs. Of course this includes 'AxBy$' and 'AxxBy$' that's all. I still cannot generalize @alistaire solution!
Upvotes: 0
Views: 1867
Reputation: 42544
The OP has requested to filter out 'AxxBy' strings but wants to keep string 'AxxByy' (where 'x' and 'y' denote digits.
Often it is easier to specify what to keep than what to remove. To keep strings which obey the pattern 'AxxByy' the regular expression
"^A\\d{2}B\\d{2}$"
can be used where ^
denotes the begin of the string, \\d{2}
a sequence of exactly two digits, and $
the end of the string. A
and B
stand for themselves.
With this regular expression, dplyr
, and grepl()
can be used to filter the input data frame DF
:
library(dplyr)
#which rows are kept?
kept <- DF %>%
+ filter(grepl("^A\\d{2}B\\d{2}$", pair))
kept
# pair
#1 A10B33
#2 A11B44
# which rows are removed?
removed <- DF %>%
+ filter(!grepl("^A\\d{2}B\\d{2}$", pair))
removed
# pair
#1 A1B2
#2 A2B3
#3 A3B4
#4 A4B22
#5 AB
#6 A
#7 B
#8 A1
#9 A12
#10 B1
#11 B12
#12 AA12B34
#13 A12BB34
Note that I've added some edge cases for demonstration.
BTW: dplyr
is not required if only the vector pair
needs to be filtered. So, in base R the alternative expressions
pair[grepl("^A\\d{2}B\\d{2}$", pair)]
grep("^A\\d{2}B\\d{2}$", pair, value = TRUE)
both return the strings to keep:
[1] "A10B33" "A11B44"
while
pair[!grepl("^A\\d{2}B\\d{2}$", pair)]
returns the removed strings:
[1] "A1B2" "A2B3" "A3B4" "A4B22" "AB" "A" "B" "A1"
[9] "A12" "B1" "B12" "AA12B34" "A12BB34"
As given by the OP but with some edge cases appended:
# create vector of test patterns using paste0() instead of paste(..., sep = "")
pair <- paste0("A", c(1:4, 10, 11), "B", c(2, 3, 4, 22, 33, 44))
# alternatvely use sprintf()
pair <- sprintf("A%iB%i", c(1:4, 10, 11), c(2, 3, 4, 22, 33, 44))
# add some edge cases
pair <- append(pair, c("AB", "A", "B", "A1", "A12", "B1", "B12", "AA12B34", "A12BB34"))
# create data frame
DF <- data.frame(pair)
DF
# pair
#1 A1B2
#2 A2B3
#3 A3B4
#4 A4B22
#5 A10B33
#6 A11B44
#7 AB
#8 A
#9 B
#10 A1
#11 A12
#12 B1
#13 B12
#14 AA12B34
#15 A12BB34
Upvotes: 2