Reputation: 11
I have a big data frame and I would like to categorize the age within the same group according to a condition. Here is an example:
Let's say I have the following data frame:
HHID Sex Age
112 1 14
112 1 15
112 2 17
112 2 19
113 2 50
113 2 51
113 2 51
For each (hhid + sex) group, I would like to create a agegrp variable based on the following condition: if there is more than 2 years age difference between each row, categorize them in different groups. Otherwise put them in the same category. Here is what I want:
HHID Sex Age Agegrp
112 1 14 14
112 1 15 14
112 2 17 17
112 2 19 17
113 2 50 50
113 2 51 50
113 2 51 50
The data set I have is really large compared to this example. Hence, I am looking for a general solution.
Upvotes: 1
Views: 453
Reputation: 11
The organize function suggested above by is a good idea. However, cumsum function cannot correctly locate the position of the age if the same age is repeated say for 15 years and change to some other age whose difference with the previous one is more than 2. I wrote the following function to solve this issue:
organize = function(v, threshold=2) v[rep.int(which(c(TRUE, diff(v) > threshold)==TRUE), c(diff(which(c(TRUE, diff(v) > threshold)==TRUE)),1))]
This function give me the desired output with a warning. The warning is "number of items to replace is not a multiple of replacement length". Now working on getting rid of the warning and any help is appreciated.
Upvotes: 0
Reputation: 28441
I would write a quick organize
function, then choose base R
, dplyr
or data.table
to implement it. Whichever you prefer:
organize <- function(v, threshold=2) v[cumsum(c(TRUE, diff(v) > threshold))]
#base (no packages)
df1$Agegrp <- with(df1, ave(Age, HHID, Sex, FUN=organize))
#dplyr
library(dplyr)
df1 %>% group_by(HHID, Sex) %>%
mutate(Agegrp = organize(Age))
#data.table
library(data.table)
setDT(df1)[, organize(Age), by=.(HHID, Sex)]
Upvotes: 2