Ratnanil
Ratnanil

Reputation: 1752

Determine subgroup index

I have a large data frame with groups and subgroups. I would like to determine the index of the subgroup in each group, like shown in the OUTPUT column of the following data frame:

df <- data.frame(
  Group = factor(c("A","A","A","A","A","B","B","B","B")),
  Subgroup = factor(c("a","a","b","b","b","a","a","b","b")),
  OUTPUT = c(1,1,2,2,2,1,1,2,2)
)

I've tried several possibilities with without any success. I'd like to work with dplyr, but I'm not sure how to go about this. The following code returns an unexpected result.

require(dplyr)

df <- df %>%
  group_by(Group) %>%
  mutate(
    OUTPUT_2 = dplyr::id(Subgroup)
  )

#df
#   Group Subgroup OUTPUT_2
#  (fctr)   (fctr)    (int)
#1      A        a        8
#2      A        a        8
#3      A        b        8
#4      A        b        8
#5      A        b        8
#6      B        a        4
#7      B        a        4
#8      B        b        4
#9      B        b        4

I've the feeling I'm close, but not getting there. Can anybody help?

Upvotes: 2

Views: 460

Answers (3)

Bulat
Bulat

Reputation: 6969

Here is a solution with data.table without aggregation:

dt[order(Subgroup), Output := cumsum(!duplicated(Subgroup)) , by = .(Group)]

This will be much faster compared to methods based on aggregation.

Upvotes: 3

akrun
akrun

Reputation: 886948

We can use the factor route with dplyr

library(dplyr)
df %>% 
    group_by(Group) %>%
    mutate(OUTPUT = as.numeric(factor(Subgroup, levels= unique(Subgroup))))
#   Group Subgroup OUTPUT
#  <fctr>   <fctr>  <dbl>
#1      A        a      1
#2      A        a      1
#3      A        b      2
#4      A        b      2
#5      A        b      2
#6      B        a      1
#7      B        a      1
#8      B        b      2
#9      B        b      2

Or another option is match with the unique elements of 'Subgroup' after grouping by 'Group'

df %>%
   group_by(Group) %>% 
   mutate(OUTPUT = match(Subgroup, unique(Subgroup)) )
#   Group Subgroup OUTPUT
#  <fctr>   <fctr>  <int>
#1      A        a      1
#2      A        a      1
#3      A        b      2
#4      A        b      2
#5      A        b      2
#6      B        a      1
#7      B        a      1
#8      B        b      2
#9      B        b      2

Upvotes: 2

eddi
eddi

Reputation: 49448

library(data.table)
dt = as.data.table(df) # or setDT to convert in place

unique(dt[, .(Group, Subgroup)])[, idx := 1:.N, by = Group][dt, on = c('Group', 'Subgroup')]
#   Group Subgroup idx OUTPUT
#1:     A        a   1      1
#2:     A        a   1      1
#3:     A        b   2      2
#4:     A        b   2      2
#5:     A        b   2      2
#6:     B        a   1      1
#7:     B        a   1      1
#8:     B        b   2      2
#9:     B        b   2      2

Translation to dplyr should be straightforward.


Another method, following the idea of using factors from aosmith's comment, is:

dt[, idx := as.integer(factor(Subgroup, unique(Subgroup))), by = Group][]

This will create a factor with correct levels per Group which is the indexing you're after.

Upvotes: 1

Related Questions