mfolkes
mfolkes

Reputation: 97

R select data frame rows using NA in search pattern

I wish to select row indexes from a source data frame given search criteria in a search pattern data frame. For simplicity and clarity, the colnames in the search pattern data frame must be a subset of those in the source data frame. But here is the kicker - both data frames can include NA's and I need to use those NA's in the search criteria so excluding those records beforehand will not fulfill my goal.

I'm including a working solution to my challenge but am interested to see if anyone has suggestions for improvements. I've done it using just base R, which is my preference, but if anyone has a quick that's already done in this package tip, I'd be grateful.

In my real situation my search pattern data frame can comprise 20k records and the data frame being searched is >500k records. Under this situation the run time can be >20 mins, which is unfortunate. thanks

selectRows <- function(df.pattern, df.x){
    output <- lapply(1:nrow(df.pattern), function(i){
        x <- df.pattern[i,]
        
        res <- lapply(colnames(x), function(col){
            if(is.na(x[,col])){
                which(is.na(df.x[,col]))
            }else{
                grep(x[,col], df.x[,col])
            }
        })
        
    index.select <- Reduce(intersect, res)
    index.select
    })#END lapply
    
    output
}#END selectRows



#df.x is the data frame to search
df.x <- expand.grid(a=1:10, b=letters, c=LETTERS, stringsAsFactors = F)
#replace some cells with NA
for(i in 1:3) df.x[i,i ] <- NA
head(df.x)

#for simplicity the search pattern df is taken from the original df, using rows with NA's and complete cases too 
df.pattern <- df.x[c(22,1:4,10),c(2,1,3)]
row.names(df.pattern) <- NULL
df.pattern

results <- selectRows(df.pattern, df.x)
unlist(results)


Upvotes: 2

Views: 52

Answers (1)

akrun
akrun

Reputation: 887223

We may use inner_join. Create a column of row sequence, then join with the 'df.pattern' and pull the sequence vector

library(dplyr)
df.x %>% 
    mutate(rn = row_number()) %>%
    inner_join(df.pattern, .) %>%
    pull(rn)
#[1] 22  1  2  3  4 10

Or paste the rows to a single string, create a logical vector with %in%, get the positions with which

which(do.call(paste, df.x) %in% do.call(paste, df.pattern[names(df.x)]))
#[1]  1  2  3  4 10 22

If the order needs to be based on the second dataset and the second data doesn't need to consider duplicate matches, use match

match(do.call(paste, df.pattern[names(df.x)]), do.call(paste, df.x))
[1] 22  1  2  3  4 10

Or a use a data.table join

library(data.table)
setDT(df.x)[, rn := .I][df.pattern, rn, on = names(df.pattern)]
#[1] 22  1  2  3  4 10

Upvotes: 2

Related Questions