BINewbies
BINewbies

Reputation: 49

flexible patterns for a factor variable in order to subset a dataframe

I have a dataframe called mydf, simplified as below:

mydf

var1                          var2
abc_color1_location1_number1  1000
xyz_color1_location1_number1  100
asd_color2_location2_number1  900
qwe_color1_location1_number2  200
sdf_color2_location1_number2  1100
qwerrrr_ahjkkk_asdfgggg       234  
sdf_color1_location2_number1  3577
abc_color1_location3_number1  86544

I want to subset the dataset flexibly based on var1 For example:

pattern <- c("abc", "color1", "number1")
newmydf <- mydf[grep(paste("_",paste(pattern,collapse="_|_"),"_",sep=""),mydf$var1,ignore.case=T),]

My expected result:

newmydf
var1                          var2
abc_color1_location1_number1  1000

However, the resulted dataframe was only being subset with pattern "abc" and "color1" only, while I want all patterns should be considered. Can anyone please help me in this case?

Many thanks in advance!

With kind regards,

Upvotes: 1

Views: 50

Answers (3)

CPak
CPak

Reputation: 13591

An alternative approach is to strsplit on _ and use all(... %in% ...)

keep <- sapply(strsplit(mydf$var1, "_"), function(x) all(pattern %in% x))
df[keep,]

Output

                          var1  var2
1 abc_color1_location1_number1  1000
8 abc_color1_location3_number1 86544

Data

df <- structure(list(var1 = c("abc_color1_location1_number1", "xyz_color1_location1_number1", 
"asd_color2_location2_number1", "qwe_color1_location1_number2", 
"sdf_color2_location1_number2", "qwerrrr_ahjkkk_asdfgggg", "sdf_color1_location2_number1", 
"abc_color1_location3_number1"), var2 = c(1000L, 100L, 900L, 
200L, 1100L, 234L, 3577L, 86544L)), .Names = c("var1", "var2"
), class = "data.frame", row.names = c(NA, -8L))

pattern <- c("abc", "color1", "number1")

Upvotes: 0

www
www

Reputation: 39174

A solution uses tidyverse and stringr. mydf2 is the final output.

The find_match is a user-defined function, which can return a vecotr with TRUE or FALSE to see if all the words in pattern are found.

By applying the find_match function, we can filter the data frame based on the results.

library(tidyverse)
library(stringr)

find_match <- function(Col, pattern){
  m <- map(pattern, str_detect, string = Col)
  names(m) <- paste("Word", pattern)
  m2 <- as_data_frame(m)
  results <- rowSums(m2) == length(pattern)
  return(results)
}

mydf2 <- mydf %>% filter(find_match(var1, pattern))
mydf2
                          var1  var2
1 abc_color1_location1_number1  1000
2 abc_color1_location3_number1 86544

Data Preparation

# Create mydf
mydf <- read.table(text = "var1                          var2
abc_color1_location1_number1  1000
                   xyz_color1_location1_number1  100
                   asd_color2_location2_number1  900
                   qwe_color1_location1_number2  200
                   sdf_color2_location1_number2  1100
                   qwerrrr_ahjkkk_asdfgggg       234  
                   sdf_color1_location2_number1  3577
                   abc_color1_location3_number1  86544",
                   header = TRUE, stringsAsFactors = FALSE)

# Define the pattern
pattern <- c("abc", "color1", "number1")

Upvotes: 0

LyzandeR
LyzandeR

Reputation: 37889

If you want all the elements of pattern to be considered, then something like this might help:

pattern <- c("abc", "color1", "number1")
alltrue <- rowSums(sapply(pattern, function(x) grepl(pattern = x, mydf$var1))) == 3

mydf[alltrue, ]
#                          var1  var2
#1 abc_color1_location1_number1  1000
#8 abc_color1_location3_number1 86544

Essentially sapply will run grepl for each one of the pattern elements and then only use those ones where all grepls are TRUE.

Upvotes: 2

Related Questions