Prunus Persica
Prunus Persica

Reputation: 1203

Efficient sub-string search over large dataset

I have a large dataset tPro1 (~500k points). Seen below, the variable of interest is tPro1$Path.

      Path                                  Row      rm                                              
1  >root>aaaa>bbbb>cccc>dddd>hello         1        TRUE
2  >root>aaaa>bbbb>cccc>dddd>greetings     2        TRUE
3  >root>aaaa>bbbb>cccc>dddd>example       3        TRUE
4  >root>iiii>jjjj>kkkk>llll>mmmm          4        TRUE
5  >root>iiii>jjjj>kkkk>nnnn>testing       5        TRUE

I also have a smaller dataset, let's call it Sub1, of a couple dozen or so dataponts. It has higher level paths than tPro1.

     [1] ">root>aaaa>bbbb>cccc>dddd"
     [2] ">root>aaaa>bbbb>eeee>ffff"
     [3] ">root>aaaa>bbbb>gggg>hhhh" 
     [4] ">root>iiii>jjjj>kkkk>llll>mmmm"
     [5] ">root>iiii>jjjj>kkkk>nnnn" 
     [6] ">root>oooo>pppp>qqqq"

What I am trying to do is associate the longer paths in tPro1 with the shorter ones in Sub1. tPro1 is a copy of some key info from Pro0. Output Pro0 would be

          Path                                  Short_path                                                    
1  >root>aaaa>bbbb>cccc>dddd>hello         >root>aaaa>bbbb>cccc>dddd
2  >root>aaaa>bbbb>cccc>dddd>greetings     >root>aaaa>bbbb>cccc>dddd
3  >root>aaaa>bbbb>cccc>dddd>example       >root>aaaa>bbbb>cccc>dddd
4  >root>iiii>jjjj>kkkk>llll>mmmm          >root>iiii>jjjj>kkkk>llll>mmmm
5  >root>iiii>jjjj>kkkk>nnnn>testing       >root>iiii>jjjj>kkkk>nnnn

I've written a loop that for each path in Sub1, grepl's each tPro1 to see if it is a substring. For 500k*24 points, this would be a very inefficient process, so I've tried some optimizations:

  1. Note tPro1$rm. When a substring is found, this is set to false. They are removed/skipped afterwards to save pointless recheck time.
    1. A Path s may appear multiple times in tPro1. Thus when a valid substring p is found for s, instead of continuing to grepl, the algorithm goes through the dataset and looks for all unchecked instances of s.

My code is

start.time <- Sys.time()

for (p in Sub1$Path) {
  for (i in 1:NROW(tPro1)) {
    if (tPro1[i,3]) {
      if (grepl(p, tPro1[i,1], fixed=TRUE)) {
        # Replace all of subpath 
        for (j in i:NROW(tPro1)) {
          if (tPro1[j,1] == tPro1[i,1]) {
            Pro0[tPro1[j,2],2] <- p
            tPro1[j,3] <- FALSE
          }
        }
      }
    }
  }
  v <- unlist(tPro1[,3])
  tPro1 <- tPro1[v,]
}

end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

Processing the full dataset does not halt in human time (on my machine at least). For illustrative purposes, doing batches of 1000 at a time (reduced tPro1) takes 46secs. 2000 takes 1 mins, 3000:1.4mins.

Any glaring improvements that could be made, or is it just the nature of the problem?

EDIT: There are around 54k unique long paths, and also not all of the long paths have a corresponding short path (eg in tPro1 there is >root>strange>path, whereas in sub1 there is no path of the form >root>strange)

EDIT2: Following rosscova's answer below, the time was taken down from possibly eternity, to 279.75 seconds!

Upvotes: 2

Views: 471

Answers (4)

amonk
amonk

Reputation: 1795

Given the two datasets (in the form of data.table):

library(data.table) # for data manipulation
library(stringi) # for string manipulation

 >dt1 
                               Path Row   rm
 1:     >root>aaaa>bbbb>cccc>dddd>hello   1 TRUE
 2: >root>aaaa>bbbb>cccc>dddd>greetings   2 TRUE
 3:   >root>aaaa>bbbb>cccc>dddd>example   3 TRUE
 4:      >root>iiii>jjjj>kkkk>llll>mmmm   4 TRUE
 5:   >root>iiii>jjjj>kkkk>nnnn>testing   5 TRUE

 > dt2 # introduced column name `names`

                        names
 1:      >root>aaaa>bbbb>cccc>dddd
 2:      >root>aaaa>bbbb>eeee>ffff
 3:      >root>aaaa>bbbb>gggg>hhhh
 4: >root>iiii>jjjj>kkkk>llll>mmmm
 5:      >root>iiii>jjjj>kkkk>nnnn
 6:           >root>oooo>pppp>qqqq

dt1b<-cbind(t(dt1[,stri_split(Path,fixed=">")]),dt1[,.(Row,rm)])[,V1:=NULL]
dt2b<-data.table(t(dt2[,stri_split(str = names,fixed=">")]))[,V1:=NULL]

 >dt1b
      V2   V3   V4   V5   V6        V7 Row   rm
1: root aaaa bbbb cccc dddd     hello   1 TRUE
2: root aaaa bbbb cccc dddd greetings   2 TRUE
3: root aaaa bbbb cccc dddd   example   3 TRUE
4: root iiii jjjj kkkk llll      mmmm   4 TRUE
5: root iiii jjjj kkkk nnnn   testing   5 TRUE

and

 >dt2b
      V2   V3   V4   V5   V6   V7
1: root aaaa bbbb cccc dddd      
2: root aaaa bbbb eeee ffff     
3: root aaaa bbbb gggg hhhh     
4: root iiii jjjj kkkk llll mmmm
5: root iiii jjjj kkkk nnnn     
6: root oooo pppp qqqq      root

Finally I compare each row of the dt1b with every row of dt2b via:

  sub1<-subset(dt1b, select = grep("^V+", names(dt1b),perl = TRUE,value = TRUE))

Create (list that contains) all the possible comparisons

  l1<-lapply(seq(1:nrow(sub1)),function(x) {l1<-lapply(seq(1:nrow(dt2b)),function(y) {l2<-data.table(t(sub1[x] %in% dt2b[y]));names(l2)<-paste0(dt2b[y]);return(l2)}); names(l1)<-paste(sub1[x],collapse=" ");return(l1)})

portion of result

     l1[1:2]
    [[1]]
    [[1]]$`root aaaa bbbb cccc dddd hello`
       root aaaa bbbb cccc dddd      
    1: TRUE TRUE TRUE TRUE TRUE FALSE

    [[1]]$<NA>
       root aaaa bbbb  eeee  ffff      
    1: TRUE TRUE TRUE FALSE FALSE FALSE

    [[1]]$<NA>
       root aaaa bbbb  gggg  hhhh      
    1: TRUE TRUE TRUE FALSE FALSE FALSE

    [[1]]$<NA>
       root  iiii  jjjj  kkkk  llll  mmmm
    1: TRUE FALSE FALSE FALSE FALSE FALSE

    [[1]]$<NA>
       root  iiii  jjjj  kkkk  nnnn      
    1: TRUE FALSE FALSE FALSE FALSE FALSE

    [[1]]$<NA>
       root  oooo  pppp  qqqq        root
    1: TRUE FALSE FALSE FALSE FALSE FALSE



    [[2]]
    [[2]]$`root aaaa bbbb cccc dddd greetings`
       root aaaa bbbb cccc dddd      
    1: TRUE TRUE TRUE TRUE TRUE FALSE

    [[2]]$<NA>
       root aaaa bbbb  eeee  ffff      
    1: TRUE TRUE TRUE FALSE FALSE FALSE

    [[2]]$<NA>
       root aaaa bbbb  gggg  hhhh      
    1: TRUE TRUE TRUE FALSE FALSE FALSE

    [[2]]$<NA>
       root  iiii  jjjj  kkkk  llll  mmmm
    1: TRUE FALSE FALSE FALSE FALSE FALSE

    [[2]]$<NA>
       root  iiii  jjjj  kkkk  nnnn      
    1: TRUE FALSE FALSE FALSE FALSE FALSE

    [[2]]$<NA>
       root  oooo  pppp  qqqq        root
    1: TRUE FALSE FALSE FALSE FALSE FALSE

So now you can have a score per row of dt1b e.g. 0/6 (not even close),..., 5/6 (almost identical), 6/6 (exactly identical).

IDEA (edit)

Here is my idea:

l2<-lapply(seq_along(1:length(l1)),function(x) {
  z=rbindlist(t(l1[[x]][1:nrow(dt2b)]),fill = TRUE)
  z=cbind(z,score=apply(z,1,sum,na.rm=TRUE))
  setorder(z,-score)
  z[,V1:=NULL]
  z<-cbind(t(rep(names(l1[[x]][1]))),z)
  names(z)[1]<-"initialString"
  return(z)
})


   > l2[1:2]
 [[1]]
                     initialString root aaaa bbbb cccc dddd  eeee  ffff  gggg  hhhh  iiii  jjjj  kkkk  llll  mmmm  nnnn score
 1: root aaaa bbbb cccc dddd hello TRUE TRUE TRUE TRUE TRUE    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA     5
 2: root aaaa bbbb cccc dddd hello TRUE TRUE TRUE   NA   NA FALSE FALSE    NA    NA    NA    NA    NA    NA    NA    NA     3
 3: root aaaa bbbb cccc dddd hello TRUE TRUE TRUE   NA   NA    NA    NA FALSE FALSE    NA    NA    NA    NA    NA    NA     3
 4: root aaaa bbbb cccc dddd hello TRUE   NA   NA   NA   NA    NA    NA    NA    NA FALSE FALSE FALSE FALSE FALSE    NA     1
 5: root aaaa bbbb cccc dddd hello TRUE   NA   NA   NA   NA    NA    NA    NA    NA FALSE FALSE FALSE    NA    NA FALSE     1

 [[2]]
                         initialString root aaaa bbbb cccc dddd  eeee  ffff  gggg  hhhh  iiii  jjjj  kkkk  llll  mmmm  nnnn score
 1: root aaaa bbbb cccc dddd greetings TRUE TRUE TRUE TRUE TRUE    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA     5
 2: root aaaa bbbb cccc dddd greetings TRUE TRUE TRUE   NA   NA FALSE FALSE    NA    NA    NA    NA    NA    NA    NA    NA     3
 3: root aaaa bbbb cccc dddd greetings TRUE TRUE TRUE   NA   NA    NA    NA FALSE FALSE    NA    NA    NA    NA    NA    NA     3
 4: root aaaa bbbb cccc dddd greetings TRUE   NA   NA   NA   NA    NA    NA    NA    NA FALSE FALSE FALSE FALSE FALSE    NA     1
 5: root aaaa bbbb cccc dddd greetings TRUE   NA   NA   NA   NA    NA    NA    NA    NA FALSE FALSE FALSE    NA    NA FALSE     1

... or by keeping the row with maximum score column, (this can be attained via: return(z) changes into return(z[score==max(score)]) in the l2 lapply() above) and rbindlist(t(l2[1:length(l2)])):

                        initialString root aaaa bbbb cccc dddd eeee ffff gggg hhhh iiii jjjj kkkk llll mmmm nnnn score
1:     root aaaa bbbb cccc dddd hello TRUE TRUE TRUE TRUE TRUE   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA     5
2: root aaaa bbbb cccc dddd greetings TRUE TRUE TRUE TRUE TRUE   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA     5
3:   root aaaa bbbb cccc dddd example TRUE TRUE TRUE TRUE TRUE   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA     5
4:      root iiii jjjj kkkk llll mmmm TRUE   NA   NA   NA   NA   NA   NA   NA   NA TRUE TRUE TRUE TRUE TRUE   NA     6
5:   root iiii jjjj kkkk nnnn testing TRUE   NA   NA   NA   NA   NA   NA   NA   NA TRUE TRUE TRUE   NA   NA TRUE     5

The column initialString now holds the initial string. The following columns hold its decomposition into substrings and its similarity score

Upvotes: 1

rosscova
rosscova

Reputation: 5580

The fact that sub is so small could help a lot in reducing the number of iterations necessary. Here's a more efficient way than what you've got, although I'm still using a loop here.

First, set up some test data. Use the same sizes as you've specified:

set.seed(123)

sub <- sapply( seq_len( 24 ), function(x) {
    paste( sample( c( letters, ">" ),
                   12,
                   replace = TRUE,
                   prob = c( rep( 1, 26 ), 8 ) ),
           collapse = "")
} )
head( sub, 3 )
# [1] "puhyz>lymjbj" "rn>yc>fbyrda" "qsmop>byrv>k"

Use sub to create tPro1 such that there are substrings to find as appropriate.

tPro1 <- paste0( sample( sub,
                         5E5,
                         replace = TRUE ),
                 sample( c( ">hello", ">adf", ">;kjadf" ),
                         5E5,
                         replace = TRUE )
)
head( tPro1, 3 )
# [1] "bjwhrj>j>>zj>adf"   "b>>>zpx>fpvg>hello" ">q>hn>ljsllh>adf"  

Now use a while loop. Iterate over sub, getting as many matches as possible in each iteration. Stop iterating if we get to the end of sub, or if all values have been filled.

results <- vector( "character", length( tPro1 ) )
i <- 1L
system.time(
    while( sum( results == "" ) > 0L && i <= length( sub ) ) {
        results[ grep( sub[i], tPro1 ) ] <- sub[i]
        i <- i + 1L
    }
)
#    user  system elapsed 
#  4.655   0.007   4.661

Output the results.

output <- data.frame( tPro1 = tPro1, results = results, stringsAsFactors = FALSE )
head( output, 3 )

#                             tPro1                  results
# 1 >>ll>ldsjbzzcszcniwm>>em>;kjadf >>ll>ldsjbzzcszcniwm>>em
# 2 ijka>ca>>>ddpmhilphqlt>c>;kjadf ijka>ca>>>ddpmhilphqlt>c
# 3 zpnsniwyletn>qzifzjtrjg>>;kjadf zpnsniwyletn>qzifzjtrjg>

So this is not a completely vectorised solution, but it does save you some time. We're down to 4.6s for the same sized dataset that you're working with.

EDIT: Silly me, I was working with sub a few thousand values long. After reducing the size of sub to a couple of dozen like you say, it makes this a lot quicker!

EDIT: with the data as you've shown it, you may need to create the tPro1 and sub vectors first:

tPro1.vec <- tPro1$Path
sub <- Sub1$Path

results <- vector( "character", length( tPro1.vec ) )
i <- 1L
while( sum( results == "" ) > 0L && i <= length( sub ) ) {
    results[ grep( sub[i], tPro1.vec ) ] <- sub[i]
    i <- i + 1L
}

Upvotes: 1

zx8754
zx8754

Reputation: 56149

Using fuzzy-matching, agrepl:

tPro1$Short_path <- Sub1$Path[ apply(sapply(Sub1$Path, function(i) agrepl(i, tPro1$Path)), 1, which) ] 

tPro1

#                                  Path Row   rm                     Short_path
# 1     >root>aaaa>bbbb>cccc>dddd>hello   1 TRUE      >root>aaaa>bbbb>cccc>dddd
# 2 >root>aaaa>bbbb>cccc>dddd>greetings   2 TRUE      >root>aaaa>bbbb>cccc>dddd
# 3   >root>aaaa>bbbb>cccc>dddd>example   3 TRUE      >root>aaaa>bbbb>cccc>dddd
# 4      >root>iiii>jjjj>kkkk>llll>mmmm   4 TRUE >root>iiii>jjjj>kkkk>llll>mmmm
# 5   >root>iiii>jjjj>kkkk>nnnn>testing   5 TRUE      >root>iiii>jjjj>kkkk>nnnn

data

tPro1  <- read.table(text = "Path                                  Row      rm                                              
1  >root>aaaa>bbbb>cccc>dddd>hello         1        TRUE
2  >root>aaaa>bbbb>cccc>dddd>greetings     2        TRUE
3  >root>aaaa>bbbb>cccc>dddd>example       3        TRUE
4  >root>iiii>jjjj>kkkk>llll>mmmm          4        TRUE
5  >root>iiii>jjjj>kkkk>nnnn>testing       5        TRUE",
                     header = TRUE, stringsAsFactors = FALSE)


Sub1 <- data.frame(Path = c(">root>aaaa>bbbb>cccc>dddd",
                            ">root>aaaa>bbbb>eeee>ffff",
                            ">root>aaaa>bbbb>gggg>hhhh",
                            ">root>iiii>jjjj>kkkk>llll>mmmm",
                            ">root>iiii>jjjj>kkkk>nnnn",
                            ">root>oooo>pppp>qqqq"),
                   stringsAsFactors = FALSE)

Upvotes: 2

Till
Till

Reputation: 707

The following code should solve your problem in no time.

library(data.table)
library(stringi)

Pro0 <- data.table(tPro1)

for (i in 1:length(Sub1$Short_path)) {
  Pro0[stri_detect_fixed(Path, Sub1$Short_path[i]), Short_path:=Sub1$Short_path[i]]
}

Using this approach I just associated 230k path names with 14 shorter path names within a second.

And this is the code I used to create the datasets tPro1 and Sub1 corresponding to your ones:

tPro1 <- data.table('Path' = list.files(path = '/usr', full.names = TRUE, recursive = TRUE))
Sub1 <- data.table('Short_path' = list.files(path = '/usr', full.names = TRUE))

Upvotes: 1

Related Questions