Jon
Jon

Reputation: 455

String concatenation using an apply function in R

I have the following code whose purpose is to transcribe a sequence in to tuples of three. It executes correctly, but is particularly slow when applied to very large data sets (i.e. millions of rows).

I suspect the culprit is the "for - loops" across a vector (particularly the for y: loop), and feel there should be a more efficient method using one of the apply functions - unfortunately I'm not overly familiar with this approach and would like to request some assistance (please!).

M.Order <- function(in.vector) {
  return.str <- vector()
  in.vector <- strsplit(in.vector, ' > ', fixed = T)
  for (x in 1:length(in.vector)) {
      output <- NULL
      if(length(in.vector[[x]]) == 1) {
          output <- paste0(in.vector[[x]], '|NULL|NULL')
      } else if(length(in.vector[[x]]) == 2) {
          output <- paste(c(in.vector[[x]][1], in.vector[[x]][2],'NULL'), collapse='|')
      } else if(length(in.vector[[x]]) == 3) {
          output <- paste(in.vector[[x]], collapse = '|')
      } else for (y in 1:(length(in.vector[[x]])-2)) {
          output <- ifelse(length(output) == 0
                          ,paste(in.vector[[x]][y:(y+2)], collapse = '|')
                          ,paste0(output, ' > ', paste(in.vector[[x]][y:(y+2)], collapse = '|'))
                          )
      }
      return.str[x] <- output
  }
return (return.str)
}

orig.str <- rbind.data.frame(
  'A > B > C > B > B > A > B > A > C',
  'A > B',
  'A > C > B',
  'A',
  'A > B > D > C')

colnames(orig.str) <- 'Original'
orig.str$Processed <- M.Order(as.character(orig.str$Original))
orig.str

which returns (correctly)

                           Original                                             Processed
1 A > B > C > B > B > A > B > A > C A|B|C > B|C|B > C|B|B > B|B|A > B|A|B > A|B|A > B|A|C
2                             A > B                                              A|B|NULL
3                         A > C > B                                                 A|C|B
4                                 A                                           A|NULL|NULL
5                     A > B > D > C                                         A|B|D > B|D|C

Upvotes: 4

Views: 2705

Answers (3)

Florian
Florian

Reputation: 25425

EDIT: remove the rollapply function, since it is slow, and created my own function. Runtime on 327,680 rows:

  • My code: 5.62 seconds
  • Your code: 5.66 seconds.

So no significant difference.

First, split the strings on the '>' character, and add NULL's to the vector if it does not have at least three elements. Then, use rollapply to concatenate groups of three characters, separated by "|"'s, and in the end collapse those groups.

# sample data
df  = data.frame(Original=c("A > B > C > B > B > A > B > A > C","A > B","A > C > B","A","A > B > D > C"),stringsAsFactors = FALSE)
for(i in 1:16) df=rbind(df,df)

groups <- function(x)
{
  result <- vector("character", length(x)-2)
  for(k in 1:(length(x)-2) )
  {
    result[k] = paste(x[k:(k+2)],collapse="|")
  }
  return(paste(result,collapse=" > "))
}

array1 = lapply(strsplit(df$Original," > "), function(x) if (length(x) == 1) {c(x[1],"NULL","NULL")} else {if (length(x) == 2) {c(x[1:2],"NULL")} else {x}})
df$modified =  lapply(array1,groups)

Output: (as list for legibility)

[[1]]
[1] "A|B|C > B|C|B > C|B|B > B|B|A > B|A|B > A|B|A > B|A|C"

[[2]]
[1] "A|B|NULL"

[[3]]
[1] "A|C|B"

[[4]]
[1] "A|NULL|NULL"

[[5]]
[1] "A|B|D > B|D|C"

Hope this helps!

Upvotes: 1

Konrad Rudolph
Konrad Rudolph

Reputation: 546063

The fundamental logic seems to be described by the following rule:

  1. Split strings by >
  2. For each string, starting at every position, merge the next 3 characters using '|'.
  3. Merge all resulting tuples with spaces.

Step 2 is the most complex. It can be solved using the following generalised function:

merge_tuples = function (str, len, sep) {
    start_positions = seq_len(max(length(str) - len + 1, 1))
    tuple_indices = lapply(start_positions, seq, length.out = len)
    lapply(tuple_indices, function (i) paste(str[i], collapse = sep))
}

This has been generalised to work with any size (not just 3) and every separator (not just '|').

Example:

> merge_tuples(c('A', 'B', 'C'), 2, ':')
[[1]]
[1] "A:B"

[[2]]
[1] "B:C"

With this in place, the res is easily solved:

orig = c('A > B > C > B > B > A > B > A > C',
         'A > B',
         'A > C > B',
         'A',
         'A > B > D > C')

tuples = lapply(strsplit(orig, ' > '), merge_tuples, len = 3, sep = '|')
merged = sapply(tuples, paste, collapse = ' ')

This will output NA instead of NULL (as in your code) in places where there are not enough elements. I’m assuming this isn’t a big deal. If it is, replace the occurrences with gsub.

Upvotes: 1

Spacedman
Spacedman

Reputation: 94277

Partial solution...

The following function converts one string:

makes = function (S) 
{
    L = strsplit(gsub(" > ", "", S), "")[[1]]
    m = outer(1:3, 0:(length(L) - 3), "+")
    m[] = L[m]
    paste(apply(m, 2, function(x) {
        paste0(x, collapse = "|")
    }), collapse = " > ")
}

It works by using outer to make a matrix of offsets and then using that to get the elements out of the string once the string has been cleaned into just the letters and split into a vector. Then its just a case of pasting it all back together:

> makes(orig.str$Original[1])
[1] "A|B|C > B|C|B > C|B|B > B|B|A > B|A|B > A|B|A > B|A|C"

It makes a hash of the ones that are shorter than 3 though:

> makes(orig.str$Original[2])
[1] "A|B|NA > A|B|A"
Warning message:
In m[] = L[m] :
  number of items to replace is not a multiple of replacement length
> makes(orig.str$Original[3])
[1] "A|C|B"
> makes(orig.str$Original[4])
Error in L[m] : only 0's may be mixed with negative subscripts
> makes(orig.str$Original[5])
[1] "A|B|D > B|D|C"

It might be worth detecting those edge cases explicitly (length(L) < 3 in the code should do it) and handling them separately.

Then apply over your data frame to do each one.

Upvotes: 0

Related Questions