zimia
zimia

Reputation: 932

Group variables by values with % of each other

I want to group rows of a df that have a certain column with values that are x% of each other. For example in the df below a 10% difference in values would be grouped into 3 groups: (A, C,F), (B,D), (E). So some kind of group by with a +/- 10% change in values.

tibble(Item = c("A","B","C","D","E","F"), value = c(1.01,2.42,1.03,2.45, 3.1, 0.99))

Upvotes: 1

Views: 167

Answers (3)

AnilGoyal
AnilGoyal

Reputation: 26238

Since you have clarified in your comments that sequential elements will not clash each other, you can do something like this

library(dplyr)
df %>% arrange(value) %>% 
  group_by(grp = cumsum(lag(value, default = 0)*1.1 <= value)) %>%
  ungroup() %>%
  arrange(Item)

# A tibble: 6 x 3
  Item  value   grp
  <chr> <dbl> <int>
1 A      1.01     1
2 B      2.42     2
3 C      1.03     1
4 D      2.45     2
5 E      3.1      3
6 F      0.99     1

This will also produce expected results, if negative values may be there in value

df %>% arrange(value) %>% 
  group_by(grp = 1 + cumsum(lag(value, default = first(value))*1.1 <= value)) %>%
  ungroup() %>%
  arrange(Item)

# A tibble: 6 x 3
  Item  value   grp
  <chr> <dbl> <dbl>
1 A      1.01     1
2 B      2.42     2
3 C      1.03     1
4 D      2.45     2
5 E      3.1      3
6 F      0.99     1

Upvotes: 1

Danielle McCool
Danielle McCool

Reputation: 430

Something like this will work. The "group" is then Item.

You can see there will be a number of edge cases because of the way you've specified it. You can drop the argument mult = "first" to parse them out.

dt <- data.table(tibble(Item = c("A","B","C","D","E","F"), value = c(1.01,2.42,1.03,2.45, 3.1, 0.99)))
dt[, `:=`(lower_bound = value * .9,
          upper_bound = value * 1.1)]
dt[dt, on = .(value > lower_bound,
              value < upper_bound), mult = "first"][, .(i.Item), Item]

#    Item i.Item
# 1:    A      A
# 2:    A      C
# 3:    A      F
# 4:    B      B
# 5:    B      D
# 6:    E      E

Upvotes: 1

tlaus
tlaus

Reputation: 28

As a quick and dirty solution I would suggest:

library(tidyverse)

df <- tibble(Item = c("A","B","C","D","E","F"), value = c(1.01,2.42,1.03,2.45, 3.1, 0.99))

df %>%
  mutate(group = ceiling((value/max(value))/0.1))

where you can vary the 0.1 divisor.

Upvotes: 0

Related Questions