Detect sequences of ordered strings and group them using R

Question

I have a string vector with about 500K elements in it and I want to assign a value to each of the element to show the group number of each element.

The grouping criteria goes like this:

a group number is assigned consecutively from the top of the list
Each element should be assigned different groups unless if a minimum of 3 consecutive elements are in ascending alphabetical order, in which these consecutive elements will be in one group.

How do I do this in R?

For example and expected output:

> my_strings <- c("xx1", "1xxx", "abc.xyz", "a", "ad022", "ghj1", "kf1", "991r",
+                 "jdd", "12vd", "r34o", "z", "034mh")
> expected_output <- c(1, 2, 3, 4, 4, 4, 4, 5, 6, 7, 7, 7, 8)
> (df <- data.frame(input = my_strings, output = expected_output))
     input output
1      xx1      1
2     1xxx      2
3  abc.xyz      3
4        a      4
5    ad022      4
6     ghj1      4
7      kf1      4
8     991r      5
9      jdd      6
10    12vd      7
11    r34o      7
12       z      7
13   034mh      8

So far, I attempt to use dplyr::lead and assign order based on two consecutive elements. I don't know how to proceed from here though.

res <- as_tibble(my_strings) %>%
  mutate(after = lead(my_strings))
res$pre_group = apply(res, 1, function(x) order(c(x[1], x[2]))[2])

r2evans · Accepted Answer

(Dang, this was a tough one :-)

tidyverse

library(dplyr)
df %>%
  mutate(r1 = cumsum(c(TRUE, diff(rank(input)) < 0)) + 0) %>%
  group_by(r1) %>%
  mutate(r2 = r1 + seq(0, 0.9*(n() < 3), len = n()) / n()) %>%
  ungroup() %>%
  mutate(r1 = with(list(rl = rle(r2)$lengths), rep(seq_along(rl), times = rl))) %>%
  select(-r2)
# # A tibble: 13 x 3
#    input   output    r1
#         
#  1 xx1          1     1
#  2 1xxx         2     2
#  3 abc.xyz      3     3
#  4 a            4     4
#  5 ad022        4     4
#  6 ghj1         4     4
#  7 kf1          4     4
#  8 991r         5     5
#  9 jdd          6     6
# 10 12vd         7     7
# 11 r34o         7     7
# 12 z            7     7
# 13 034mh        8     8

(The lengthy with(...) in the mutate is just an inline version of data.table::rleid.)

`data.table`

library(data.table)
as.data.table(df)[
, r1 := cumsum(c(TRUE, diff(rank(input)) < 0)) + 0 ][
, r1 := r1 + seq(0, 0.9*(.N < 3), len = .N), by = .(r1) ][
, r1 := rleid(r1) ]

If you want to blur the lines of R-dialects a little, then

library(data.table)
library(magrittr)
as.data.table(df) %>%
  .[, r1 := cumsum(c(TRUE, diff(rank(input)) < 0)) + 0 ] %>%
  .[, r1 := r1 + seq(0, 0.9*(.N < 3), len = .N), by = .(r1) ] %>%
  .[, r1 := rleid(r1) ]

Notes:

... + 0 is short-hand for as.numeric(...). This is because data.table enforces the column's original class when updating a column; since the first definition of r1 (without +0) would be integer, the next reassignment of r1 returns numeric. However, since data.table persists the original class, the numbers will be coerced (truncated) to integer and my efforts halted.
seq(0, 0.9*(...)) reduces to seq(0,0) when there are three or more in a group, which results in a no-op on that group. (This uses dplyr's n() and data.table's .N for group-size.)
the implementations differ slightly because dplyr prohibits modifying the grouping variable(s); data.table has no issue with this. (I'm not certain which direction is correct or better ...)

Detect sequences of ordered strings and group them using R

Answers (2)

tidyverse

`data.table`

Related Questions