Reputation: 4883
I have a dataset with more than 2 million lines and several columns. Some columns are hospital codes, which correspond to all the conditions each patient had during that hospitalisation. I need to perform some summaries for each condition, so I'm trying to create a dataset that will have information about a singular condition of interest.
The codes have 5 digits, but sometimes I want to select codes that begin with three digits (the remaining two digits don't mattter), for instance I want every row that has a code that begin with 401 in all the columns that contains these codes. Small example:
id dx_1 dx_2 dx_3 dx_n
1 401
2 2500 4011
3 18524
I would want id 1 and 2. I've tried something but I get an error and it's slow. Any pointers or suggestions are most welcomed. If anything is unclear I will try to give more information.
final_DB[apply(grep(paste("^", i, sep=""), final_DB[,10:29]), 1, any),]
i
correspond to the number I want so in this case i <- 401
and the columns 10 to 29 are all the columns where this code might be.
Upvotes: 2
Views: 108
Reputation: 887871
One option would be filter_at
to select the columns of interest, check whether any of the variables have the substr
, 401 at the beginning to filter the rows
library(dplyr)
df1 %>%
filter_at(vars(starts_with("dx")), any_vars(substr(., 1, 3) == '401'))
# id dx_1 dx_2 dx_3 dx_n
#1 1 401 NA NA NA
#2 2 2500 4011 NA NA
Or using base R
, loop through the columns of interest (in this case, all the columns except the first), use grepl
and check if the pattern
"^401" is there or not - returns a list
of logical vector
s, which we Reduce
to a single logical vector
with |
, use that to subset the rows of the data
df1[Reduce(`|`, lapply(df1[-1], grepl, pattern = "^401")), ]
Regarding the issue in the OP's post
final_DB[apply(grep(paste("^", i, sep=""), final_DB[,10:29]), 1, any),]
Here the grep
is applied on a data.frame instead of a vector
and grep
works on vector/matrices
. To correct it we loop through the rows (it would be inefficient though - just to correct the code)
i1 <- apply(final_DB[, 10:29], 1, function(x) any(grepl(paste("^", i, sep=""), x)))
df1 <- structure(list(id = 1:3, dx_1 = c(401L, 2500L, 18524L), dx_2 = c(NA,
4011L, NA), dx_3 = c(NA, NA, NA), dx_n = c(NA, NA, NA)),
class = "data.frame", row.names = c(NA, -3L))
Upvotes: 2
Reputation: 160932
I'll use mtcars
to demonstrate one method (in base R). (BTW: it is not clear to me that your data is character
or numeric
, but it doesn't matter: grep*
functions will happily convert to character
to find things, as in grepl("^123", 122:124)
... though floating point regex should obviously be taken with a grain-of-salt.)
Let's say we want every row where something starts with 20 through 25:
mt <- mtcars[1:10, 1:7]
sapply(mt, grepl, pattern = "^2[0-5]")
# mpg cyl disp hp drat wt qsec
# [1,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
# [2,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
# [3,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
# [4,] TRUE FALSE TRUE FALSE FALSE FALSE FALSE
# [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [6,] FALSE FALSE TRUE FALSE FALSE FALSE TRUE
# [7,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE
# [8,] TRUE FALSE FALSE FALSE FALSE FALSE TRUE
# [9,] TRUE FALSE FALSE FALSE FALSE FALSE TRUE
# [10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
To highlight what those are:
mt
# mpg cyl disp hp drat wt qsec
# Mazda RX4 *21.0* 6 160.0 110 3.90 2.620 16.46
# Mazda RX4 Wag *21.0* 6 160.0 110 3.90 2.875 17.02
# Datsun 710 *22.8* 4 108.0 93 3.85 2.320 18.61
# Hornet 4 Drive *21.4* 6 *258.0* 110 3.08 3.215 19.44
# Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02
# Valiant 18.1 6 *225.0* 105 2.76 3.460 *20.22*
# Duster 360 14.3 8 360.0 *245* 3.21 3.570 15.84
# Merc 240D *24.4* 4 146.7 62 3.69 3.190 *20.00*
# Merc 230 *22.8* 4 140.8 95 3.92 3.150 *22.90*
# Merc 280 19.2 6 167.6 123 3.92 3.440 18.30
Now to use this:
mt[ rowSums(sapply(mt, grepl, pattern = "^2[0-5]")) > 0, ]
# mpg cyl disp hp drat wt qsec
# Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46
# Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61
# Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44
# Valiant 18.1 6 225.0 105 2.76 3.460 20.22
# Duster 360 14.3 8 360.0 245 3.21 3.570 15.84
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90
If you only need to check a specific set of columns, add the column-selection to mt
within the sapply
:
mt[ rowSums(sapply(mt[,c(1,4,7)], grepl, pattern = "^2[0-5]")) > 0, ]
# mpg cyl disp hp drat wt qsec
# Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46
# Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61
# Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44
# Valiant 18.1 6 225.0 105 2.76 3.460 20.22
# Duster 360 14.3 8 360.0 245 3.21 3.570 15.84
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90
Upvotes: 2