Reputation: 33
I have a string vector with about 500K elements in it and I want to assign a value to each of the element to show the group number of each element.
The grouping criteria goes like this:
How do I do this in R?
For example and expected output:
> my_strings <- c("xx1", "1xxx", "abc.xyz", "a", "ad022", "ghj1", "kf1", "991r",
+ "jdd", "12vd", "r34o", "z", "034mh")
> expected_output <- c(1, 2, 3, 4, 4, 4, 4, 5, 6, 7, 7, 7, 8)
> (df <- data.frame(input = my_strings, output = expected_output))
input output
1 xx1 1
2 1xxx 2
3 abc.xyz 3
4 a 4
5 ad022 4
6 ghj1 4
7 kf1 4
8 991r 5
9 jdd 6
10 12vd 7
11 r34o 7
12 z 7
13 034mh 8
So far, I attempt to use dplyr::lead
and assign order based on two consecutive elements. I don't know how to proceed from here though.
res <- as_tibble(my_strings) %>%
mutate(after = lead(my_strings))
res$pre_group = apply(res, 1, function(x) order(c(x[1], x[2]))[2])
Upvotes: 3
Views: 115
Reputation: 160447
(Dang, this was a tough one :-)
library(dplyr)
df %>%
mutate(r1 = cumsum(c(TRUE, diff(rank(input)) < 0)) + 0) %>%
group_by(r1) %>%
mutate(r2 = r1 + seq(0, 0.9*(n() < 3), len = n()) / n()) %>%
ungroup() %>%
mutate(r1 = with(list(rl = rle(r2)$lengths), rep(seq_along(rl), times = rl))) %>%
select(-r2)
# # A tibble: 13 x 3
# input output r1
# <chr> <dbl> <int>
# 1 xx1 1 1
# 2 1xxx 2 2
# 3 abc.xyz 3 3
# 4 a 4 4
# 5 ad022 4 4
# 6 ghj1 4 4
# 7 kf1 4 4
# 8 991r 5 5
# 9 jdd 6 6
# 10 12vd 7 7
# 11 r34o 7 7
# 12 z 7 7
# 13 034mh 8 8
(The lengthy with(...)
in the mutate
is just an inline version of data.table::rleid
.)
data.table
library(data.table)
as.data.table(df)[
, r1 := cumsum(c(TRUE, diff(rank(input)) < 0)) + 0 ][
, r1 := r1 + seq(0, 0.9*(.N < 3), len = .N), by = .(r1) ][
, r1 := rleid(r1) ]
If you want to blur the lines of R-dialects a little, then
library(data.table)
library(magrittr)
as.data.table(df) %>%
.[, r1 := cumsum(c(TRUE, diff(rank(input)) < 0)) + 0 ] %>%
.[, r1 := r1 + seq(0, 0.9*(.N < 3), len = .N), by = .(r1) ] %>%
.[, r1 := rleid(r1) ]
Notes:
... + 0
is short-hand for as.numeric(...)
. This is because data.table
enforces the column's original class
when updating a column; since the first definition of r1
(without +0
) would be integer
, the next reassignment of r1
returns numeric
. However, since data.table
persists the original class, the numbers will be coerced (trunc
ated) to integer and my efforts halted.
seq(0, 0.9*(...))
reduces to seq(0,0)
when there are three or more in a group, which results in a no-op on that group. (This uses dplyr
's n()
and data.table
's .N
for group-size.)
the implementations differ slightly because dplyr
prohibits modifying the grouping variable(s); data.table
has no issue with this. (I'm not certain which direction is correct or better ...)
Upvotes: 2
Reputation: 3134
Not nearly as good as r2evans', but also seems to give the result.
x <- my_strings
n <- length(x)
c(FALSE,x[-1L] > x[-n]) &
c(FALSE,FALSE,x[-1L][-1L] > x[-n][-(n-1)]) &
c(FALSE,FALSE,FALSE,x[-1L][-1L][-1L] > x[-n][-(n-1)][-(n-2)])
(lead(x, 1) > x & lead(x,2) > lead(x,1)) |
(lag(x, 1) < x & lead(x,1) > x) |
(lag(x, 1) < x & lag(x,2) < lag(x,1)) -> condition
condition[is.na(condition)] <- FALSE # remove NAs
#to visualize
tibble(lag(x,2), lag(x,1), x, lead(x,1), lead(x,2), condition)
# There may be a better way than a loop
cur_class <- 0
classes <- integer(n)
for(i in 1:(n)){
if(!condition[i]){ #not in a sequence
cur_class <- cur_class + 1
classes[i] <- cur_class
} else if(!condition[i-1]){ #first of a sequence
cur_class <- cur_class + 1
classes[i] <- cur_class
} else{ #mid-sequence
classes[i] <- cur_class
}
}
tibble(x, classes, condition*1L)
# A tibble: 13 x 3
# x classes `condition * 1L`
# <chr> <dbl> <int>
# 1 xx1 1 0
# 2 1xxx 2 0
# 3 abc.xyz 3 0
# 4 a 4 1
# 5 ad022 4 1
# 6 ghj1 4 1
# 7 kf1 4 1
# 8 991r 5 0
# 9 jdd 6 0
# 10 12vd 7 1
# 11 r34o 7 1
# 12 z 7 1
# 13 034mh 8 0
Upvotes: 1