String decomposition

Question

I need to decompose about 75 million character strings using R. I need to do something like creating a Term Document matrix, where each word that occurs in the document becomes a column in the matrix and anywhere the term occurs, the matrix element is coded as 1.

I have: About 75 million character strings ranging in length from about 0-100 characters; they represent a time series giving coded information about what happened in that period. Each code is exactly one character and corresponds to a time period.

I need: Some kind of matrix or way of conveying the information that takes away the time series and just tells me how many times a certain code was reported in each series.

For instance: The string "ABCDEFG-123" would become be a row in the matrix where each character would be tallied as occurring once. If this is too difficult a matrix of 0s and 1s would also give me some information though I would prefer to keep as much information as possible.

Does anyone have any ideas of how to do this quickly? There are 20 possible codes.

Frank · Accepted Answer

Example:

my20chars = c(LETTERS[1:10], 0:9)

set.seed(1)
x = replicate(1e4, paste0(sample(c(my20chars,"-"),10, replace=TRUE), collapse=""))

One approach:

library(data.table)

d = setDT(stack(strsplit(setNames(x,x),"")))
dcast(d[ values %in% my20chars ], ind ~ values, fun = length)

Result:

              ind 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J
    1: ---8EEAD8I 0 0 0 0 0 0 0 0 2 0 1 0 0 1 2 0 0 0 1 0
    2: --33B6E-32 0 0 1 3 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0
    3: --3IFBG8GI 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 1 2 0 2 0
    4: --4210I8H5 1 1 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0
    5: --5H4DE9F- 0 0 0 0 1 1 0 0 0 1 0 0 0 1 1 1 0 1 0 0
   ---                                                   
 9996: JJFJBJ24AJ 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 5
 9997: JJI-J-0FGB 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 3
 9998: JJJ1B54H63 0 1 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 1 0 3
 9999: JJJED7A3FI 0 0 0 1 0 0 0 1 0 0 1 0 0 1 1 1 0 0 1 3
10000: JJJIF6GI13 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 2 3

Benchmark:

library(microbenchmark)

nstrs  = 1e5
nchars = 10
x = replicate(nstrs, paste0(sample(c(my20chars,"-"), nchars, replace=TRUE), collapse=""))

microbenchmark(
dcast = {
  d = setDT(stack(strsplit(setNames(x,x),"")))
  dcast(d[ values %in% my20chars ], ind ~ values, fun = length, value.var="ind")
},
times = 10)

# Unit: seconds
#   expr      min       lq     mean   median       uq      max neval
#  dcast 3.112633 3.423935 3.480692 3.494176 3.573967 3.741931    10

So, this is not fast enough to handle the OP's 75 million strings, but may be a good place to start.

String decomposition

Answers (2)

Related Questions