Reputation: 729
I have a large data set with around 1 million records and 80 columns. To speed up processing, I am using data.table. I need to create a new column based on a condition and I am lost as to how to do this in data.table
Below is the code for sample data:
set.seed(1200)
N_Blocks = 1348
cyc=200
City1 <- vector()
City2 <- vector()
a1 <- vector()
a2 <- vector()
for (a in 1:cyc) {
City1 <- sample(paste("City", formatC(a, width=nchar(cyc), flag="0"), sep=""),N_Blocks,rep=T)
a1 <- sample(0:1,N_Blocks,rep = T)
City2 <- append(City2,City1)
a2 <- append(a2,a1)
}
df1 <- data.frame(City2,a2)
Now the requirement is that for each city (currently we are having 200 cities in this sample data) and for a2 == 1 I need to create a new column that will have the total number of 1s divided in 12 months. So for example City001 & a2 == 1 with the seed of 1200 I get 671 records. So the new column Months needs to have codes 01-12. So the 1st 56 records where a2 == 1 will have code 01 then next 56 records will have code 02 and so on.....and the last 55 records of City001 with a2 == 1 will have code 12 (so the total adds to 671). Something like splitting the selection of a2 for every City in 12 months.
We can get the City level summary of selection from the command -
table(df1$City2,df1$a2)
Can we achieve this using data.table?
Upvotes: 3
Views: 1188
Reputation: 83225
Using:
library(data.table)
setDT(df1)[a2 == 1, mon := cut(seq_along(a2), 12, sprintf('%02d',1:12)), by = City2][]
gives:
City2 a2 mon 1: City001 1 01 2: City001 1 01 3: City001 1 01 4: City001 1 01 5: City001 0 NA --- 269596: City200 1 12 269597: City200 0 NA 269598: City200 1 12 269599: City200 1 12 269600: City200 1 12
What this does:
setDT(df1)
the data.frame is converted to a data.table (which still also a data.frame).a2 == 1
and grouped with by = City2
.mon
is created by reference for the selected rows with cut
.City2
-column, cut
divides a sequence (seq_along(a2)
; this starts with 1
and ends with the groupsize, i.e. 671
for the first group) in 12 (almost) equal parts. Each part gets a corresponding label (sprintf('%02d',1:12)
). See ?cut
for more details.The drawback of the above method is that the break with the lowest number of records isn't at the end. Which can be shown by summarising the result:
> df1[a2 == 1, .N, by = .(City2,mon)][1:24] City2 mon N 1: City001 01 56 2: City001 02 56 3: City001 03 56 4: City001 04 56 5: City001 05 56 6: City001 06 56 7: City001 07 55 8: City001 08 56 9: City001 09 56 10: City001 10 56 11: City001 11 56 12: City001 12 56 13: City002 01 56 14: City002 02 55 15: City002 03 56 16: City002 04 55 17: City002 05 56 18: City002 06 55 19: City002 07 55 20: City002 08 56 21: City002 09 55 22: City002 10 56 23: City002 11 55 24: City002 12 56
To get the breaks with the lower number of records at the end, you can use:
setDT(df1)[a2 == 1, mon := cut(seq_along(a2),
{n <- .N/12;
br <- c(0, rep(ceiling(n), round((n-floor(n))*12)), rep(floor(n), round((floor(n) - n + 1)*12)));
unique(c(cumsum(br),.N))},
sprintf('%02d',1:12)),
by = City2][]
The number of breaks (12
) from the first solution is replaced by a vector of breakpoints which is calculated for each group of City2
separately in the part between the curly brackets ({n <- .N/12 .... unique(c(cumsum(br),.N))}
). This part calculates the breaks in such a way that the number of observations aren't randomly distributed over the months, but such that the months with the lower number of observations will always be at the end.
In addition to the explanantion above, the part between the curly brackets does the following:
.N
) is divided by 12
. Often this won't be a integer number but a numeric with values after the decimal point.rep(ceiling(n), round((n-floor(n))*12))
calculates the higher number of observations for the first groups (ceiling(n)
) and repeats that a certain number of times which is determined by the decimalvalue of n
with round((n-floor(n))*12)
.rep(floor(n), round((floor(n) - n + 1)*12))
is a similar way. floor(n)
determines the size of the group, while round((floor(n) - n + 1)*12)
determine the number of groups that is needed for the lower groupsize.c(0, rep ... )
).0
to which the groupsize is added at the end (.N
). By wrapping this in unique
you make sure you get unique break values. This is needed because when n
is a prefectly rounded number (e.g. the outcome of 600/12), the last break-value appears two times in the vector.The same check now shows that this is achieved:
> df1[a2 == 1, .N, by = .(City2,mon)][1:24] City2 mon N 1: City001 01 56 2: City001 02 56 3: City001 03 56 4: City001 04 56 5: City001 05 56 6: City001 06 56 7: City001 07 56 8: City001 08 56 9: City001 09 56 10: City001 10 56 11: City001 11 56 12: City001 12 55 13: City002 01 56 14: City002 02 56 15: City002 03 56 16: City002 04 56 17: City002 05 56 18: City002 06 56 19: City002 07 55 20: City002 08 55 21: City002 09 55 22: City002 10 55 23: City002 11 55 24: City002 12 55
Upvotes: 2