Quickly finding rows that contain a value with specific substring requirements

Question

I have a data frame that is 40 columns wide and 3 million long. Each cell can contain a value, or is missing. Each row has at least a few cells filled. I am interested in finding those ROWS that contain any value starting with "M" and having a '3' as the sixth character. My biggest issue is how to handle this given the size of the data frame...

n=40*300000 # 300k already takes long, let alone 3M!
data <- data.frame(matrix(paste0(sample(LETTERS, n, replace=T), sample(c(10000:99999), n, replace=T)), ncol=40))

The following code will find all the codes starting with M and having a '3' at the end, however, it is slow... and then I also need to have a vector returned that shows me which row contains any of the codes of interest (1) and which not (0).

data[sapply(data, substring, 1, 1) == "M" & is.na(data)==F & sapply(data, substring, 6, 6) == "3" ]

My main issue is that I need a speedy solution!

alistaire · Accepted Answer

startsWith and endsWith are faster than substring, and each sapply is a loop, which will take a while. (If you need sapply but need speed, check out vapply.)

Here, we can use apply to evaluate each row for any matching elements where df is the original data.frame (data is a bad name that can cause conflicts). Rows with NAs and matching values will return TRUE; rows with NAs but no matching values will return NA. If you'd rather have FALSE, wrap x below in na.omit.

# takes 7 seconds on my machine
row_indices <- apply(df, 1, function(x){any(startsWith(x, 'M') & endsWith(x, '3'))})

head(row_indices)
## [1] FALSE FALSE FALSE FALSE  TRUE FALSE

# effectively instantaneous
df_subset <- df[row_indices, ]

df_subset[1, ]
##       X1     X2     X3     X4     X5     X6     X7     X8     X9    X10    X11    X12    X13
## 5 Q69164 D42439 X17664 A81746 Z82859 B10892 I39329 O29425 D83560 W14944 M64225 K47156 X26742
##      X14    X15    X16    X17    X18    X19    X20    X21    X22    X23    X24    X25    X26
## 5 I51962 Q57501 Q29214 W20713 U84761 S35597 D93796 F15041 V51597 O93538 O55946 F67256 D85638
##      X27    X28    X29    X30    X31    X32    X33    X34    X35    X36    X37    X38    X39
## 5 N82913 Q55887 V10815 M59412 L17626 E83108 E40069 I21677 U99952 X24291 O55932 M79693 C48984
##      X40
## 5 O63422

row_indices is a logical vector with TRUE for each row that satisfies the conditions, and FALSE otherwise. If you want it as a vector of 1s and 0s, coerce to integer:

row_indices_integer <- as.integer(row_indices)

head(row_indices_integer)
## [1] 0 0 0 0 1 0

Bonus: If you want an index of matches, a faster way is to convert to a matrix, which you can index as a giant vector. Both coercing and subsetting takes about 3 seconds on my machine.

df_m <- as.matrix(df)    # make sure you have enough memory

matches <- df_m[startsWith(df_m, 'M') & endsWith(df_m, '3')]

head(matches)
## [1] "M67343" "M73753" "M61813" "M67903" "M25393" "M64273"

Quickly finding rows that contain a value with specific substring requirements

Answers (2)

Related Questions