Reputation: 932
I want to group rows of a df that have a certain column with values that are x% of each other. For example in the df below a 10% difference in values would be grouped into 3 groups: (A, C,F), (B,D), (E). So some kind of group by with a +/- 10% change in values.
tibble(Item = c("A","B","C","D","E","F"), value = c(1.01,2.42,1.03,2.45, 3.1, 0.99))
Upvotes: 1
Views: 167
Reputation: 26238
Since you have clarified in your comments that sequential elements will not clash each other, you can do something like this
library(dplyr)
df %>% arrange(value) %>%
group_by(grp = cumsum(lag(value, default = 0)*1.1 <= value)) %>%
ungroup() %>%
arrange(Item)
# A tibble: 6 x 3
Item value grp
<chr> <dbl> <int>
1 A 1.01 1
2 B 2.42 2
3 C 1.03 1
4 D 2.45 2
5 E 3.1 3
6 F 0.99 1
This will also produce expected results, if negative values may be there in value
df %>% arrange(value) %>%
group_by(grp = 1 + cumsum(lag(value, default = first(value))*1.1 <= value)) %>%
ungroup() %>%
arrange(Item)
# A tibble: 6 x 3
Item value grp
<chr> <dbl> <dbl>
1 A 1.01 1
2 B 2.42 2
3 C 1.03 1
4 D 2.45 2
5 E 3.1 3
6 F 0.99 1
Upvotes: 1
Reputation: 430
Something like this will work. The "group" is then Item.
You can see there will be a number of edge cases because of the way you've specified it. You can drop the argument mult = "first" to parse them out.
dt <- data.table(tibble(Item = c("A","B","C","D","E","F"), value = c(1.01,2.42,1.03,2.45, 3.1, 0.99)))
dt[, `:=`(lower_bound = value * .9,
upper_bound = value * 1.1)]
dt[dt, on = .(value > lower_bound,
value < upper_bound), mult = "first"][, .(i.Item), Item]
# Item i.Item
# 1: A A
# 2: A C
# 3: A F
# 4: B B
# 5: B D
# 6: E E
Upvotes: 1
Reputation: 28
As a quick and dirty solution I would suggest:
library(tidyverse)
df <- tibble(Item = c("A","B","C","D","E","F"), value = c(1.01,2.42,1.03,2.45, 3.1, 0.99))
df %>%
mutate(group = ceiling((value/max(value))/0.1))
where you can vary the 0.1
divisor.
Upvotes: 0