select rows that match condition in several columns

Question

I have a dataset with more than 2 million lines and several columns. Some columns are hospital codes, which correspond to all the conditions each patient had during that hospitalisation. I need to perform some summaries for each condition, so I'm trying to create a dataset that will have information about a singular condition of interest.

The codes have 5 digits, but sometimes I want to select codes that begin with three digits (the remaining two digits don't mattter), for instance I want every row that has a code that begin with 401 in all the columns that contains these codes. Small example:

id dx_1 dx_2 dx_3 dx_n
1  401  
2  2500 4011
3  18524

I would want id 1 and 2. I've tried something but I get an error and it's slow. Any pointers or suggestions are most welcomed. If anything is unclear I will try to give more information.

final_DB[apply(grep(paste("^", i, sep=""), final_DB[,10:29]), 1, any),]

i correspond to the number I want so in this case i <- 401 and the columns 10 to 29 are all the columns where this code might be.

akrun · Accepted Answer

One option would be filter_at to select the columns of interest, check whether any of the variables have the substr, 401 at the beginning to filter the rows

library(dplyr)
df1 %>%
    filter_at(vars(starts_with("dx")), any_vars(substr(., 1, 3) == '401'))
#    id dx_1 dx_2 dx_3 dx_n
#1  1  401   NA   NA   NA
#2  2 2500 4011   NA   NA

Or using base R, loop through the columns of interest (in this case, all the columns except the first), use grepl and check if the pattern "^401" is there or not - returns a list of logical vectors, which we Reduce to a single logical vector with |, use that to subset the rows of the data

df1[Reduce(`|`, lapply(df1[-1], grepl, pattern = "^401")), ]

Regarding the issue in the OP's post

final_DB[apply(grep(paste("^", i, sep=""), final_DB[,10:29]), 1, any),]

Here the grep is applied on a data.frame instead of a vector and grep works on vector/matrices. To correct it we loop through the rows (it would be inefficient though - just to correct the code)

i1 <- apply(final_DB[, 10:29], 1, function(x) any(grepl(paste("^", i, sep=""), x)))

data

df1 <- structure(list(id = 1:3, dx_1 = c(401L, 2500L, 18524L), dx_2 = c(NA, 
 4011L, NA), dx_3 = c(NA, NA, NA), dx_n = c(NA, NA, NA)), 
 class = "data.frame", row.names = c(NA, -3L))

select rows that match condition in several columns

Answers (2)

data

Related Questions