Reputation: 958
I have a data frame that is 40 columns wide and 3 million long. Each cell can contain a value, or is missing. Each row has at least a few cells filled. I am interested in finding those ROWS that contain any value starting with "M" and having a '3' as the sixth character. My biggest issue is how to handle this given the size of the data frame...
n=40*300000 # 300k already takes long, let alone 3M!
data <- data.frame(matrix(paste0(sample(LETTERS, n, replace=T), sample(c(10000:99999), n, replace=T)), ncol=40))
The following code will find all the codes starting with M and having a '3' at the end, however, it is slow... and then I also need to have a vector returned that shows me which row contains any of the codes of interest (1) and which not (0).
data[sapply(data, substring, 1, 1) == "M" & is.na(data)==F & sapply(data, substring, 6, 6) == "3" ]
My main issue is that I need a speedy solution!
Upvotes: 1
Views: 94
Reputation: 43334
startsWith
and endsWith
are faster than substring
, and each sapply
is a loop, which will take a while. (If you need sapply
but need speed, check out vapply
.)
Here, we can use apply
to evaluate each row for any
matching elements where df
is the original data.frame (data
is a bad name that can cause conflicts). Rows with NA
s and matching values will return TRUE
; rows with NA
s but no matching values will return NA
. If you'd rather have FALSE
, wrap x
below in na.omit
.
# takes 7 seconds on my machine
row_indices <- apply(df, 1, function(x){any(startsWith(x, 'M') & endsWith(x, '3'))})
head(row_indices)
## [1] FALSE FALSE FALSE FALSE TRUE FALSE
# effectively instantaneous
df_subset <- df[row_indices, ]
df_subset[1, ]
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
## 5 Q69164 D42439 X17664 A81746 Z82859 B10892 I39329 O29425 D83560 W14944 M64225 K47156 X26742
## X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26
## 5 I51962 Q57501 Q29214 W20713 U84761 S35597 D93796 F15041 V51597 O93538 O55946 F67256 D85638
## X27 X28 X29 X30 X31 X32 X33 X34 X35 X36 X37 X38 X39
## 5 N82913 Q55887 V10815 M59412 L17626 E83108 E40069 I21677 U99952 X24291 O55932 M79693 C48984
## X40
## 5 O63422
row_indices
is a logical vector with TRUE
for each row that satisfies the conditions, and FALSE
otherwise. If you want it as a vector of 1s and 0s, coerce to integer:
row_indices_integer <- as.integer(row_indices)
head(row_indices_integer)
## [1] 0 0 0 0 1 0
Bonus: If you want an index of matches, a faster way is to convert to a matrix, which you can index as a giant vector. Both coercing and subsetting takes about 3 seconds on my machine.
df_m <- as.matrix(df) # make sure you have enough memory
matches <- df_m[startsWith(df_m, 'M') & endsWith(df_m, '3')]
head(matches)
## [1] "M67343" "M73753" "M61813" "M67903" "M25393" "M64273"
Upvotes: 3
Reputation: 3947
I can't vouch for "quickly", but this is hopefully reasonably fast.
Leave the data as a matrix, then apply a grepl
over it searching for the pattern once. Your subset has multiple conditions that are searched independently, which significantly increases processing time. Rather, a regex
search only passes over the matrix once. Here I'll use just the first three LETTERS
so the results are clear:
## simple example
set.seed(1337)
n=40L*30L
## leave the data as a matrix
data <- matrix(paste0(sample(LETTERS[1:3], n, replace=T), sample(c(10000:99999), n, replace=T)), ncol=40)
head(data)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,] "B38741" "C40838" "C55778" "C68287" "A75243" "C78612" "C97148"
#> [2,] "B74071" "A96422" "A32523" "B24202" "C37424" "A43132" "C22392"
#> [3,] "A59076" "C46817" "C60189" "C45901" "B21047" "A37862" "A11632"
#> [4,] "B46388" "A58535" "C97558" "B40380" "B68486" "A47478" "B79243"
#> [5,] "B43385" "C94160" "A61546" "C57602" "C56007" "B79242" "B87491"
#> [6,] "A21263" "B16484" "A97093" "B85175" "A81363" "C11184" "A55769"
#> [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> [1,] "A60364" "B40600" "B47028" "A91703" "A29644" "C18189" "C27397"
#> [2,] "A79862" "C14384" "A76795" "C56399" "C17494" "A97131" "B93359"
#> [3,] "A71215" "B11017" "C28779" "A93073" "B41637" "B22438" "B71168"
#> [4,] "C93900" "B16089" "C11622" "B47537" "A59246" "A70339" "B50932"
#> [5,] "C49844" "A47108" "C25245" "C13744" "A33516" "B74005" "B66537"
#> [6,] "C18771" "A19343" "C80973" "B43260" "A34422" "C93295" "B78965"
#> [,15] [,16] [,17] [,18] [,19] [,20] [,21]
#> [1,] "A11028" "B56979" "A82965" "B22924" "A41345" "A87897" "C50275"
#> [2,] "B94595" "C57700" "C52641" "A20425" "A71739" "B68890" "B41689"
#> [3,] "A81026" "C96270" "A25976" "B58948" "A90145" "A46911" "A91987"
#> [4,] "B67972" "A67499" "A25004" "A82919" "A65356" "B68767" "C14904"
#> [5,] "A63617" "A51983" "C23317" "C89821" "B34018" "A14499" "A52000"
#> [6,] "B72869" "B29073" "C44628" "A49682" "B40933" "B63931" "B33618"
#> [,22] [,23] [,24] [,25] [,26] [,27] [,28]
#> [1,] "C71020" "C25807" "C71922" "C54801" "B71390" "C46267" "A21002"
#> [2,] "A95403" "B10427" "C60669" "B28275" "C27129" "A73589" "C51430"
#> [3,] "C82188" "A49040" "A47890" "C50170" "C46741" "A80651" "C96004"
#> [4,] "A71968" "B54081" "C11767" "A79653" "C56663" "A34188" "B69740"
#> [5,] "A95088" "B22995" "C41531" "B29874" "B28097" "A51224" "B31888"
#> [6,] "B80639" "A45005" "C18228" "C96166" "A68540" "C12882" "C87778"
#> [,29] [,30] [,31] [,32] [,33] [,34] [,35]
#> [1,] "A12776" "C37308" "A42790" "C72401" "A35752" "B89962" "C83482"
#> [2,] "B78347" "C64552" "C18353" "B93191" "C75289" "C67916" "B67485"
#> [3,] "A70870" "A80181" "C82050" "B82481" "A49520" "C23423" "C70112"
#> [4,] "A32132" "A14885" "C98653" "C26822" "A60790" "C23231" "C71949"
#> [5,] "B60348" "C44937" "C31036" "A90012" "C81055" "C13806" "A13882"
#> [6,] "C91144" "C90857" "A14201" "B61155" "C43489" "B31109" "B83677"
#> [,36] [,37] [,38] [,39] [,40]
#> [1,] "A56118" "C84724" "C80238" "C22425" "B93431"
#> [2,] "C76893" "C16722" "A95319" "B54856" "B36155"
#> [3,] "C35688" "C39321" "A37359" "B26495" "A88542"
#> [4,] "B65839" "A65305" "C56256" "C11397" "B11289"
#> [5,] "A73778" "A69947" "A98016" "A40535" "B38351"
#> [6,] "B18384" "C61695" "B56768" "C39147" "C27065"
Extracting the elements (don't extract them if you want a matrix of TRUE/FALSE
in the original dimensions)
## find elements which start with "A" and end with "3"
data[grepl("^A.*3$", data)]
#> [1] "A21263" "A32523" "A97093" "A75243" "A81363" "A12483" "A25883"
#> [8] "A71073" "A19343" "A46733" "A91703" "A93073" "A80463" "A97873"
#> [15] "A87763" "A17953" "A91103" "A88293" "A47273" "A32483" "A68723"
#> [22] "A51983" "A91283" "A75353" "A71283" "A91793" "A75253" "A95403"
#> [29] "A73373" "A64863" "A74823" "A80853" "A41883" "A79653" "A97783"
#> [36] "A96793" "A57573" "A77983" "A93863" "A25783" "A20173" "A25313"
#> [43] "A53833" "A17933"
I've recently fallen in love with rex
for building regex
and while this is a simple search, it's more readable with R
syntax:
## alternatively, using the power of rex
data[grepl(rex::rex(start, "A", anything, "3", end), data)]
#> [1] "A21263" "A32523" "A97093" "A75243" "A81363" "A12483" "A25883"
#> [8] "A71073" "A19343" "A46733" "A91703" "A93073" "A80463" "A97873"
#> [15] "A87763" "A17953" "A91103" "A88293" "A47273" "A32483" "A68723"
#> [22] "A51983" "A91283" "A75353" "A71283" "A91793" "A75253" "A95403"
#> [29] "A73373" "A64863" "A74823" "A80853" "A41883" "A79653" "A97783"
#> [36] "A96793" "A57573" "A77983" "A93863" "A25783" "A20173" "A25313"
#> [43] "A53833" "A17933"
Upvotes: 3