Dilsat
Dilsat

Reputation: 11

Categorizing Age According to a Condition in R

I have a big data frame and I would like to categorize the age within the same group according to a condition. Here is an example:

Let's say I have the following data frame:

    HHID  Sex  Age 
    112   1    14
    112   1    15
    112   2    17
    112   2    19
    113   2    50
    113   2    51
    113   2    51

For each (hhid + sex) group, I would like to create a agegrp variable based on the following condition: if there is more than 2 years age difference between each row, categorize them in different groups. Otherwise put them in the same category. Here is what I want:

    HHID  Sex  Age Agegrp
    112   1    14   14
    112   1    15   14
    112   2    17   17
    112   2    19   17
    113   2    50   50
    113   2    51   50
    113   2    51   50

The data set I have is really large compared to this example. Hence, I am looking for a general solution.

Upvotes: 1

Views: 453

Answers (2)

Dilsat
Dilsat

Reputation: 11

The organize function suggested above by is a good idea. However, cumsum function cannot correctly locate the position of the age if the same age is repeated say for 15 years and change to some other age whose difference with the previous one is more than 2. I wrote the following function to solve this issue:

organize = function(v, threshold=2) v[rep.int(which(c(TRUE, diff(v) > threshold)==TRUE), c(diff(which(c(TRUE, diff(v) > threshold)==TRUE)),1))] 

This function give me the desired output with a warning. The warning is "number of items to replace is not a multiple of replacement length". Now working on getting rid of the warning and any help is appreciated.

Upvotes: 0

Pierre L
Pierre L

Reputation: 28441

I would write a quick organize function, then choose base R, dplyr or data.table to implement it. Whichever you prefer:

organize <- function(v, threshold=2) v[cumsum(c(TRUE, diff(v) > threshold))]

#base (no packages)
df1$Agegrp <- with(df1, ave(Age, HHID, Sex, FUN=organize))

#dplyr
library(dplyr)
df1 %>% group_by(HHID, Sex) %>%
  mutate(Agegrp = organize(Age))

#data.table
library(data.table)
setDT(df1)[, organize(Age), by=.(HHID, Sex)]

Upvotes: 2

Related Questions