Creating new column based on row values of multiple data subsetting conditions

Question

I have a dataframe that looks more or less like follows (the original one has 12 years of data):

   Year   Quarter   Age_1   Age_2   Age_3   Age_4
   2005      1       158     120     665     32
   2005      2       257     145     121     14
   2005      3       68       69     336     65
   2005      4       112     458     370     101
   2006      1       75      457     741     26
   2006      2       365     134     223     45
   2006      3       257     121     654     341
   2006      4       175     124     454     12
   2007      1       697     554     217     47
   2007      2       954     987     118     54
   2007      4       498     235     112     65

Where the numbers in the age columns represents the amount of individuals in each age class for a specific quarter within a specific year. It is noteworthy that sometimes not all quarters in a specific year have data (e.g., third quarter is not represented in 2007). Also, each row represents a sampling event. Although not shown in this example, in the original dataset I always have more than one sampling event for a specific quarter within a specific year. For example, for the first quarter in 2005 I have 47 sampling events, leading therefore to 47 rows.

What I´d like to have now is a dataframe structured in a way like:

       Year   Quarter   Age_1   Age_2   Age_3   Age_4    Cohort
       2005      1       158     120     665     32        158
       2005      2       257     145     121     14        257
       2005      3       68       69     336     65         68
       2005      4       112     458     370     101       112
       2006      1       75      457     741     26        457 
       2006      2       365     134     223     45        134
       2006      3       257     121     654     341       121
       2006      4       175     124     454     12        124
       2007      1       697     554     217     47         47
       2007      2       954     987     118     54         54
       2007      4       498     235     112     65         65

In this case, I want to create a new column (Cohort) in my original dataset which basically follows my cohorts along my dataset. In other words, when I´m in my first year of data (2005 with all quarters), I take the row values of Age_1 and paste it into the new column. When I move to the next year (2006), then I take all my row values related to my Age_2 and paste it to the new column, and so on and so forth.

I have tried to use the following function, but somehow it only works for the first couple of years:

extract_cohort_quarter <- function(d, yearclass=2005, quarterclass=1) {

 ny <- 1:nlevels(d$Year) #no. of Year levels in the dataset 
 nq <- 1:nlevels(d$Quarter)
 age0 <- (paste("age", ny, sep="_"))
 year0 <- as.character(yearclass + ny - 1)

quarter <- as.character(rep(1:4, length(age0)))
age <- rep(age0,each=4)
year <- rep(year0,each=4)

df <- data.frame(year,age,quarter,stringsAsFactors=FALSE)

n <- nrow(df)
dnew <- NULL
for(i in 1:n) {
    tmp <- subset(d, Year==df$year[i] & Quarter==df$quarter[i])
    tmp$Cohort <- tmp[[age[i]]]
    dnew <- rbind(dnew, tmp)
}
levels(dnew$Year) <- paste("Yearclass_", yearclass, ":", 
year,":",quarter,":", age, sep="")
dnew
}

I have plenty of data from age_1 to age_12 for all the years and quarters, so I don´t think that it´s something related to the data structure itself.

Is there an easier solution to solve this problem? Or is there a way to improve my extract_cohort_quarter() function? Any help will be much appreciated.

-M

denis · Accepted Answer

I have a simple solution but that demands bit of knowledge of the data.table library. I think you can easily adapt it to your further needs. Here is the data:

DT <- as.data.table(list(Year   = c(2005,   2005,   2005,   2005,   2006,   2006    ,2006   ,2006,  2007,   2007,   2007),
                         Quarter= c(1,  2,  3,  4   ,1  ,2  ,3  ,4  ,1  ,2  ,4),
                         Age_1  = c(158,    257,    68, 112 ,75,    365,    257,    175,    697 ,954,   498),
                         Age_2= c(120   ,145    ,69 ,458    ,457,   134 ,121    ,124    ,554    ,987,   235),
                         Age_3= c(665   ,121    ,336    ,370    ,741    ,223    ,654    ,454,217,118,112),
                         Age_4= c(32,14,65,101,26,45,341,12,47,54,65)

))

Here is th code :

DT[,index := .GRP, by = Year]
DT[,cohort := get(paste0("Age_",index)),by = Year]

and the output:

> DT
    Year Quarter Age_1 Age_2 Age_3 Age_4 index cohort
 1: 2005       1   158   120   665    32     1    158
 2: 2005       2   257   145   121    14     1    257
 3: 2005       3    68    69   336    65     1     68
 4: 2005       4   112   458   370   101     1    112
 5: 2006       1    75   457   741    26     2    457
 6: 2006       2   365   134   223    45     2    134
 7: 2006       3   257   121   654   341     2    121
 8: 2006       4   175   124   454    12     2    124
 9: 2007       1   697   554   217    47     3    217
10: 2007       2   954   987   118    54     3    118
11: 2007       4   498   235   112    65     3    112

What it does:

DT[,index := .GRP, by = Year]

creates an index for all different year in your table (by = Year makes an operation for group of year, .GRP create an index following the grouping sequence). I use it to call the column that you named Age_ with the number created

DT[,cohort := get(paste0("Age_",index)),by = Year]

You can even do everything in the single line

DT[,cohort := get(paste0("Age_",.GRP)),by = Year]

I hope it helps

Creating new column based on row values of multiple data subsetting conditions

Answers (2)

Related Questions