jmh12
jmh12

Reputation: 33

Detect sequences of ordered strings and group them using R

I have a string vector with about 500K elements in it and I want to assign a value to each of the element to show the group number of each element.

The grouping criteria goes like this:

How do I do this in R?

For example and expected output:

> my_strings <- c("xx1", "1xxx", "abc.xyz", "a", "ad022", "ghj1", "kf1", "991r",
+                 "jdd", "12vd", "r34o", "z", "034mh")
> expected_output <- c(1, 2, 3, 4, 4, 4, 4, 5, 6, 7, 7, 7, 8)
> (df <- data.frame(input = my_strings, output = expected_output))
     input output
1      xx1      1
2     1xxx      2
3  abc.xyz      3
4        a      4
5    ad022      4
6     ghj1      4
7      kf1      4
8     991r      5
9      jdd      6
10    12vd      7
11    r34o      7
12       z      7
13   034mh      8

So far, I attempt to use dplyr::lead and assign order based on two consecutive elements. I don't know how to proceed from here though.

res <- as_tibble(my_strings) %>%
  mutate(after = lead(my_strings))
res$pre_group = apply(res, 1, function(x) order(c(x[1], x[2]))[2])

Upvotes: 3

Views: 115

Answers (2)

r2evans
r2evans

Reputation: 160447

(Dang, this was a tough one :-)

tidyverse

library(dplyr)
df %>%
  mutate(r1 = cumsum(c(TRUE, diff(rank(input)) < 0)) + 0) %>%
  group_by(r1) %>%
  mutate(r2 = r1 + seq(0, 0.9*(n() < 3), len = n()) / n()) %>%
  ungroup() %>%
  mutate(r1 = with(list(rl = rle(r2)$lengths), rep(seq_along(rl), times = rl))) %>%
  select(-r2)
# # A tibble: 13 x 3
#    input   output    r1
#    <chr>    <dbl> <int>
#  1 xx1          1     1
#  2 1xxx         2     2
#  3 abc.xyz      3     3
#  4 a            4     4
#  5 ad022        4     4
#  6 ghj1         4     4
#  7 kf1          4     4
#  8 991r         5     5
#  9 jdd          6     6
# 10 12vd         7     7
# 11 r34o         7     7
# 12 z            7     7
# 13 034mh        8     8

(The lengthy with(...) in the mutate is just an inline version of data.table::rleid.)

data.table

library(data.table)
as.data.table(df)[
, r1 := cumsum(c(TRUE, diff(rank(input)) < 0)) + 0 ][
, r1 := r1 + seq(0, 0.9*(.N < 3), len = .N), by = .(r1) ][
, r1 := rleid(r1) ]

If you want to blur the lines of R-dialects a little, then

library(data.table)
library(magrittr)
as.data.table(df) %>%
  .[, r1 := cumsum(c(TRUE, diff(rank(input)) < 0)) + 0 ] %>%
  .[, r1 := r1 + seq(0, 0.9*(.N < 3), len = .N), by = .(r1) ] %>%
  .[, r1 := rleid(r1) ]

Notes:

  • ... + 0 is short-hand for as.numeric(...). This is because data.table enforces the column's original class when updating a column; since the first definition of r1 (without +0) would be integer, the next reassignment of r1 returns numeric. However, since data.table persists the original class, the numbers will be coerced (truncated) to integer and my efforts halted.

  • seq(0, 0.9*(...)) reduces to seq(0,0) when there are three or more in a group, which results in a no-op on that group. (This uses dplyr's n() and data.table's .N for group-size.)

  • the implementations differ slightly because dplyr prohibits modifying the grouping variable(s); data.table has no issue with this. (I'm not certain which direction is correct or better ...)

Upvotes: 2

Alexlok
Alexlok

Reputation: 3134

Not nearly as good as r2evans', but also seems to give the result.

x <- my_strings
n <- length(x)
c(FALSE,x[-1L] > x[-n]) &
c(FALSE,FALSE,x[-1L][-1L] > x[-n][-(n-1)]) &
c(FALSE,FALSE,FALSE,x[-1L][-1L][-1L] > x[-n][-(n-1)][-(n-2)])

(lead(x, 1) > x & lead(x,2) > lead(x,1)) |
  (lag(x, 1) < x & lead(x,1) > x) |
  (lag(x, 1) < x & lag(x,2) < lag(x,1)) -> condition

condition[is.na(condition)] <- FALSE # remove NAs

#to visualize
tibble(lag(x,2), lag(x,1), x, lead(x,1), lead(x,2), condition)

# There may be a better way than a loop
cur_class <- 0
classes <- integer(n)
for(i in 1:(n)){
  if(!condition[i]){ #not in a sequence
    cur_class <- cur_class + 1
    classes[i] <- cur_class
  } else if(!condition[i-1]){ #first of a sequence
    cur_class <- cur_class + 1
    classes[i] <- cur_class
  } else{ #mid-sequence
    classes[i] <- cur_class
  }
}

tibble(x, classes, condition*1L)

# A tibble: 13 x 3
#   x       classes `condition * 1L`
#  <chr>     <dbl>            <int>
# 1 xx1           1                0
# 2 1xxx          2                0
# 3 abc.xyz       3                0
# 4 a             4                1
# 5 ad022         4                1
# 6 ghj1          4                1
# 7 kf1           4                1
# 8 991r          5                0
# 9 jdd           6                0
# 10 12vd          7                1
# 11 r34o          7                1
# 12 z             7                1
# 13 034mh         8                0

Upvotes: 1

Related Questions