gnotnek
gnotnek

Reputation: 319

R: Renumber subgroups within groups, starting with 1 each time

What I have is a dataset "df" that has subgroups within groups, except the subgroups are numbered 1-N universally, rather than 1-n within each group. Currently it looks like this:

df = data.frame( c( rep(58, 10), rep(59, 12) ), 
                 c( rep(417, 4), rep(418, 5), rep(419, 1), rep(420, 7), 
                   rep(421, 5) ) )

colnames(df) = c("group", "subgroup_global")

df

     group      subgroup_global
1      58              417
2      58              417
3      58              417
4      58              417
5      58              418
6      58              418
7      58              418
8      58              418
9      58              418
10     58              419
11     59              420
12     59              420
13     59              420
14     59              420
15     59              420
16     59              420
17     59              420
18     59              421
19     59              421
20     59              421
21     59              421
22     59              421

I'd like to turn it into this, so the first subgroup within each group restarts at 1:

       group   subgroup_global subgroup
1      58              417         1
2      58              417         1
3      58              417         1
4      58              417         1
5      58              418         2
6      58              418         2
7      58              418         2
8      58              418         2
9      58              418         2
10     58              419         3
11     59              420         1
12     59              420         1
13     59              420         1
14     59              420         1
15     59              420         1
16     59              420         1
17     59              420         1
18     59              421         2
19     59              421         2
20     59              421         2
21     59              421         2
22     59              421         2

I have figured out a very cumbersome way to do this with essentially two nested loops:

renumber <- function(c) {
                c$subgroup <- 1
                x <- which(!duplicated(c$subgroup_global))
                c[x, "subgroup"] <- 1:length(x)
                for(i in x) {
                        val = c[i, "subgroup"]
                        valGlobal = c[i, "subgroup_global"]
                        rows = which(c$subgroup_global == as.numeric(valGlobal))
                        c[rows, "subgroup"] = val
                        }
                c
}

Next, I iterate this "renumber" function within each group of the dataset (you'll notice I sneak in the "filter" dplyr function among a sea of base R):

df_renumbered <- data.frame()
for(i in unique(df$group)) {
        df_partial <- filter(df, group == i)
        df_partial <- renumber(df_partial)
        df_renumbered <- rbind(df_renumbered, df_partial)
}
df_renumbered

Besides looking inefficient, this code takes (for all intents and purposes) an infinite amount of time to run on my real dataset, which contains 200,000+ rows, 3,000+ groups, and 27,000+ subgroups. Could you please provide a feasible solution using either dplyr or base R or some combo of the two?

Upvotes: 1

Views: 419

Answers (3)

Daniel Hoop
Daniel Hoop

Reputation: 692

EDIT: In case that as.numeric(as.factor(x)) does not work (for whatever reason I didn't figure out yet), use the following function to renumber the values.

renumber <- function(x){
  # Function inside for more clarity.
  replace.values <- function(search, replace, x){
    return(replace[ match(x, search) ])
  }
  # Replacement here.
  ux <- unique(x)
  return(replace.values(ux, 1:length(ux), x))
}

OLD answer: I know that this question was asked a long time ago. In the past I also used as.numeric(factor(x)) to solve the problem. However, at the moment (2016.12.21) it doesn't work anymore with my R version (3.3.1).

The new way to do it is to replace unique(subgroup_global) by values in the range 1:length(unique(subgroup_global)). I solved your problem below.

# Create your data
df = data.frame( c( rep(58, 10), rep(59, 12) ), 
                 c( rep(417, 4), rep(418, 5), rep(419, 1), rep(420, 7), 
                    rep(421, 5) ) )
colnames(df) = c("group", "subgroup_global")

# Define function to replace values.
replace.values <- function(search, replace, x){
  return(replace[ match(x, search) ])
}

# Use by... to loop through all groups in the data.frame
# See how function inside by() works by setting: x <- df[df[,"group"]==df[1,"group"],]
df <- do.call("rbind", by(df, df[,"group"], function(x){
  # replace unique(subgroup_global) by values in the range 1:length(unique(subgroup_global))
  ux <- unique(x[,"subgroup_global"])
  return(cbind(x, subgroup=replace.values(ux, 1:length(ux), x[,"subgroup_global"])))
  }))
rownames(df) <- NULL

# Print you new df
print(df)

   group subgroup_global subgroup
1     58             417        1
2     58             417        1
3     58             417        1
4     58             417        1
5     58             418        2
6     58             418        2
7     58             418        2
8     58             418        2
9     58             418        2
10    58             419        3
11    59             420        1
12    59             420        1
13    59             420        1
14    59             420        1
15    59             420        1
16    59             420        1
17    59             420        1
18    59             421        2
19    59             421        2
20    59             421        2
21    59             421        2
22    59             421        2

Upvotes: 0

jogo
jogo

Reputation: 12569

I prefer data.table

library(data.table)
setDT(df)
setkey(df, group, subgroup_global)
df[, subgroup:=as.factor(subgroup_global), by=group] # or
df[, subgroup:=as.numeric(as.factor(subgroup_global)), by=group]
# from data.table 1.9.6 on:
setDT(df)[, subgroup := frank(subgroup_global, ties.method = "dense"), by=group]
# (in this case setkey() above is not necessary)

Upvotes: 0

nicola
nicola

Reputation: 24520

You can use the base function ave:

ave(df$subgroup_global,df$group,FUN=function(x) as.numeric(factor(x)))

Upvotes: 1

Related Questions