Conditional creation of column with increment in data.table

Question

I have a large data set with around 1 million records and 80 columns. To speed up processing, I am using data.table. I need to create a new column based on a condition and I am lost as to how to do this in data.table

Below is the code for sample data:

set.seed(1200)
N_Blocks = 1348
cyc=200
City1 <- vector()
City2 <- vector()
a1 <- vector()
a2 <- vector()

for (a in 1:cyc) {
City1 <- sample(paste("City", formatC(a, width=nchar(cyc), flag="0"), sep=""),N_Blocks,rep=T)
a1 <- sample(0:1,N_Blocks,rep = T)

City2 <- append(City2,City1)
a2 <- append(a2,a1)
}

df1 <- data.frame(City2,a2)

Now the requirement is that for each city (currently we are having 200 cities in this sample data) and for a2 == 1 I need to create a new column that will have the total number of 1s divided in 12 months. So for example City001 & a2 == 1 with the seed of 1200 I get 671 records. So the new column Months needs to have codes 01-12. So the 1st 56 records where a2 == 1 will have code 01 then next 56 records will have code 02 and so on.....and the last 55 records of City001 with a2 == 1 will have code 12 (so the total adds to 671). Something like splitting the selection of a2 for every City in 12 months.

We can get the City level summary of selection from the command -

table(df1$City2,df1$a2)

Can we achieve this using data.table?

Jaap · Accepted Answer

Using:

library(data.table)
setDT(df1)[a2 == 1, mon := cut(seq_along(a2), 12, sprintf('%02d',1:12)), by = City2][]

gives:

          City2 a2 mon
     1: City001  1  01
     2: City001  1  01
     3: City001  1  01
     4: City001  1  01
     5: City001  0  NA
    ---               
269596: City200  1  12
269597: City200  0  NA
269598: City200  1  12
269599: City200  1  12
269600: City200  1  12

What this does:

With setDT(df1) the data.frame is converted to a data.table (which still also a data.frame).
The data is filtered with a2 == 1 and grouped with by = City2.
Finally, a new column mon is created by reference for the selected rows with cut.
For each group of the City2-column, cut divides a sequence (seq_along(a2); this starts with 1 and ends with the groupsize, i.e. 671 for the first group) in 12 (almost) equal parts. Each part gets a corresponding label (sprintf('%02d',1:12)). See ?cut for more details.

The drawback of the above method is that the break with the lowest number of records isn't at the end. Which can be shown by summarising the result:

> df1[a2 == 1, .N, by = .(City2,mon)][1:24]
      City2 mon  N
 1: City001  01 56
 2: City001  02 56
 3: City001  03 56
 4: City001  04 56
 5: City001  05 56
 6: City001  06 56
 7: City001  07 55
 8: City001  08 56
 9: City001  09 56
10: City001  10 56
11: City001  11 56
12: City001  12 56
13: City002  01 56
14: City002  02 55
15: City002  03 56
16: City002  04 55
17: City002  05 56
18: City002  06 55
19: City002  07 55
20: City002  08 56
21: City002  09 55
22: City002  10 56
23: City002  11 55
24: City002  12 56

To get the breaks with the lower number of records at the end, you can use:

setDT(df1)[a2 == 1, mon := cut(seq_along(a2),
                               {n <- .N/12; 
                                br <- c(0, rep(ceiling(n), round((n-floor(n))*12)), rep(floor(n), round((floor(n) - n + 1)*12)));
                                unique(c(cumsum(br),.N))},
                               sprintf('%02d',1:12)),
           by = City2][]

The number of breaks (12) from the first solution is replaced by a vector of breakpoints which is calculated for each group of City2 separately in the part between the curly brackets ({n <- .N/12 .... unique(c(cumsum(br),.N))}). This part calculates the breaks in such a way that the number of observations aren't randomly distributed over the months, but such that the months with the lower number of observations will always be at the end.

In addition to the explanantion above, the part between the curly brackets does the following:

First the number of observations in each group (.N) is divided by 12. Often this won't be a integer number but a numeric with values after the decimal point.
rep(ceiling(n), round((n-floor(n))*12)) calculates the higher number of observations for the first groups (ceiling(n)) and repeats that a certain number of times which is determined by the decimalvalue of n with round((n-floor(n))*12).
The groupsize of smaller groups (breaks) is determined by rep(floor(n), round((floor(n) - n + 1)*12)) is a similar way. floor(n) determines the size of the group, while round((floor(n) - n + 1)*12) determine the number of groups that is needed for the lower groupsize.
The previous two vectors are put together with a starting zero (c(0, rep ... )).
By using cumsum on that, you get a vector starting with 0 to which the groupsize is added at the end (.N). By wrapping this in unique you make sure you get unique break values. This is needed because when n is a prefectly rounded number (e.g. the outcome of 600/12), the last break-value appears two times in the vector.

The same check now shows that this is achieved:

> df1[a2 == 1, .N, by = .(City2,mon)][1:24]
      City2 mon  N
 1: City001  01 56
 2: City001  02 56
 3: City001  03 56
 4: City001  04 56
 5: City001  05 56
 6: City001  06 56
 7: City001  07 56
 8: City001  08 56
 9: City001  09 56
10: City001  10 56
11: City001  11 56
12: City001  12 55
13: City002  01 56
14: City002  02 56
15: City002  03 56
16: City002  04 56
17: City002  05 56
18: City002  06 56
19: City002  07 55
20: City002  08 55
21: City002  09 55
22: City002  10 55
23: City002  11 55
24: City002  12 55

Conditional creation of column with increment in data.table

Answers (1)

Related Questions