Reputation: 319
What I have is a dataset "df" that has subgroups within groups, except the subgroups are numbered 1-N universally, rather than 1-n within each group. Currently it looks like this:
df = data.frame( c( rep(58, 10), rep(59, 12) ),
c( rep(417, 4), rep(418, 5), rep(419, 1), rep(420, 7),
rep(421, 5) ) )
colnames(df) = c("group", "subgroup_global")
df
group subgroup_global
1 58 417
2 58 417
3 58 417
4 58 417
5 58 418
6 58 418
7 58 418
8 58 418
9 58 418
10 58 419
11 59 420
12 59 420
13 59 420
14 59 420
15 59 420
16 59 420
17 59 420
18 59 421
19 59 421
20 59 421
21 59 421
22 59 421
I'd like to turn it into this, so the first subgroup within each group restarts at 1:
group subgroup_global subgroup
1 58 417 1
2 58 417 1
3 58 417 1
4 58 417 1
5 58 418 2
6 58 418 2
7 58 418 2
8 58 418 2
9 58 418 2
10 58 419 3
11 59 420 1
12 59 420 1
13 59 420 1
14 59 420 1
15 59 420 1
16 59 420 1
17 59 420 1
18 59 421 2
19 59 421 2
20 59 421 2
21 59 421 2
22 59 421 2
I have figured out a very cumbersome way to do this with essentially two nested loops:
renumber <- function(c) {
c$subgroup <- 1
x <- which(!duplicated(c$subgroup_global))
c[x, "subgroup"] <- 1:length(x)
for(i in x) {
val = c[i, "subgroup"]
valGlobal = c[i, "subgroup_global"]
rows = which(c$subgroup_global == as.numeric(valGlobal))
c[rows, "subgroup"] = val
}
c
}
Next, I iterate this "renumber" function within each group of the dataset (you'll notice I sneak in the "filter" dplyr function among a sea of base R):
df_renumbered <- data.frame()
for(i in unique(df$group)) {
df_partial <- filter(df, group == i)
df_partial <- renumber(df_partial)
df_renumbered <- rbind(df_renumbered, df_partial)
}
df_renumbered
Besides looking inefficient, this code takes (for all intents and purposes) an infinite amount of time to run on my real dataset, which contains 200,000+ rows, 3,000+ groups, and 27,000+ subgroups. Could you please provide a feasible solution using either dplyr or base R or some combo of the two?
Upvotes: 1
Views: 419
Reputation: 692
EDIT: In case that as.numeric(as.factor(x)) does not work (for whatever reason I didn't figure out yet), use the following function to renumber the values.
renumber <- function(x){
# Function inside for more clarity.
replace.values <- function(search, replace, x){
return(replace[ match(x, search) ])
}
# Replacement here.
ux <- unique(x)
return(replace.values(ux, 1:length(ux), x))
}
OLD answer: I know that this question was asked a long time ago. In the past I also used as.numeric(factor(x)) to solve the problem. However, at the moment (2016.12.21) it doesn't work anymore with my R version (3.3.1).
The new way to do it is to replace unique(subgroup_global) by values in the range 1:length(unique(subgroup_global)). I solved your problem below.
# Create your data
df = data.frame( c( rep(58, 10), rep(59, 12) ),
c( rep(417, 4), rep(418, 5), rep(419, 1), rep(420, 7),
rep(421, 5) ) )
colnames(df) = c("group", "subgroup_global")
# Define function to replace values.
replace.values <- function(search, replace, x){
return(replace[ match(x, search) ])
}
# Use by... to loop through all groups in the data.frame
# See how function inside by() works by setting: x <- df[df[,"group"]==df[1,"group"],]
df <- do.call("rbind", by(df, df[,"group"], function(x){
# replace unique(subgroup_global) by values in the range 1:length(unique(subgroup_global))
ux <- unique(x[,"subgroup_global"])
return(cbind(x, subgroup=replace.values(ux, 1:length(ux), x[,"subgroup_global"])))
}))
rownames(df) <- NULL
# Print you new df
print(df)
group subgroup_global subgroup
1 58 417 1
2 58 417 1
3 58 417 1
4 58 417 1
5 58 418 2
6 58 418 2
7 58 418 2
8 58 418 2
9 58 418 2
10 58 419 3
11 59 420 1
12 59 420 1
13 59 420 1
14 59 420 1
15 59 420 1
16 59 420 1
17 59 420 1
18 59 421 2
19 59 421 2
20 59 421 2
21 59 421 2
22 59 421 2
Upvotes: 0
Reputation: 12569
I prefer data.table
library(data.table)
setDT(df)
setkey(df, group, subgroup_global)
df[, subgroup:=as.factor(subgroup_global), by=group] # or
df[, subgroup:=as.numeric(as.factor(subgroup_global)), by=group]
# from data.table 1.9.6 on:
setDT(df)[, subgroup := frank(subgroup_global, ties.method = "dense"), by=group]
# (in this case setkey() above is not necessary)
Upvotes: 0
Reputation: 24520
You can use the base
function ave
:
ave(df$subgroup_global,df$group,FUN=function(x) as.numeric(factor(x)))
Upvotes: 1