Reputation: 63
I have a table table as below: df
KEY CAT DATE AMOUNT VAR VALUE
1 26 2015/10/1 1400 Event.Budget_Cat26 NA
1 26 2015/10/1 300 Event.Budget_Cat26 NA
1 26 2015/10/1 NA NA NA
1 22 2015/10/1 100 Event.Budget_Cat22 NA
1 22 2015/10/1 300 Event.Budget_Cat22 NA
1 22 2015/10/1 NA NA NA
2 21 2014/1/1 200 Event.Budget_Cat21 NA
2 21 2014/1/1 NA NA NA
This is just part of the large table. Basically, each row is unique by (KEY,CAT, DATE). I want to find a way to figure out the sum of amount based on the unique identify and put the final sum into VALUE column. Also, give a name to the VAR. The AMOUNT columns exist just for the VALUE columns and I will delete the whole columns in the end, so keep it as NA is okay.
The final result will be something like
KEY CAT DATE AMOUNT VAR VALUE
1 26 2015/10/1 1400 Event.Budget_Cat26 NA
1 26 2015/10/1 300 Event.Budget_Cat26 NA
1 26 2015/10/1 NA Scalar.Budget_Cat26_Amt 1700
1 22 2015/10/1 100 Event.Budget_Cat22 NA
1 22 2015/10/1 300 Event.Budget_Cat22 NA
1 22 2015/10/1 NA Scalar.Budget_Cat22_Amt 400
2 21 2014/1/1 200 Event.Budget_Cat21 NA
2 21 2014/1/1 NA Scalar.Budget_Cat21_Amt 200
I have try the rbind and lapply, but the result is not what I want
df[, 4:6 := lapply(.SD,
function(x) replace(x, is.na(x), c("",paste("Scalar_Budget_Cat",CAT,"_Amt",sep =""),sum(x, na.rm=TRUE))),
KEY, .SDcols=4]
Could anyone help me to figure out a fast way to run through the whole data table, which contains 2,600,000 rows? Thank you
Upvotes: 0
Views: 1527
Reputation: 23909
I tried it on 2.6 million rows and it is really fast. We split the task in two operations. The first one is to group the rows by KEY
, CAT
and DATE
and then sum over the AMOUNT
column. After that we select the rows where is.na(VAR) == T
and fill the VAR
column with the corresponding string.
df[, VALUE:=sum(AMOUNT, na.rm = T),by=list(KEY, CAT, DATE)][is.na(VAR), VAR:=paste0("Scalar_Budget_Cat",CAT,"_Amt"),]
The difference to your desired output is, that there are no more NA
values in VALUE
. But that can be set back afterwards using df[!is.na(AMOUNT), VALUE:=NA,]
Upvotes: 1
Reputation: 887541
There are ways to create 'VALUE' with the sum
of 'AMOUNT' after grouping by 'KEY', 'CAT', 'DATE' and later changing those values to NA
based on whether it is not the last value for group. Instead of doing that, we use rep
to create NA
and sum
of 'AMOUNT' for the 'VALUE' column and modifying the 'VAR' column to have the last element for each group with `Scalar' substring
setDT(df1)[, c("VAR", "VALUE") := .(c(VAR[-.N], paste0("Scalar.",
sub("^[^.]+\\.", "", VAR[1]), "_Amt")), rep(c(NA, sum(AMOUNT, na.rm = TRUE)),
c(.N-1, 1))), .(KEY, CAT, DATE)]
df1
# KEY CAT DATE AMOUNT VAR VALUE
#1: 1 26 2015/10/1 1400 Event.Budget_Cat26 NA
#2: 1 26 2015/10/1 300 Event.Budget_Cat26 NA
#3: 1 26 2015/10/1 NA Scalar.Budget_Cat26_Amt 1700
#4: 1 22 2015/10/1 100 Event.Budget_Cat22 NA
#5: 1 22 2015/10/1 300 Event.Budget_Cat22 NA
#6: 1 22 2015/10/1 NA Scalar.Budget_Cat22_Amt 400
#7: 2 21 2014/1/1 200 Event.Budget_Cat21 NA
#8: 2 21 2014/1/1 NA Scalar.Budget_Cat21_Amt 200
Upvotes: 1
Reputation: 2826
Here's a solution with tidyverse
:
library(tidyverse)
KEY <- c(1, 1, 1, 1, 1, 1, 2, 2)
CAT <- c(26, 26, 26, 22, 22, 22, 21,21)
DATE <- c('2015/10/1', '2015/10/1', '2015/10/1', '2015/10/1', '2015/10/1', '2015/10/1', '2014/1/1', '2014/1/1')
AMOUNT <- c(1400, 300, NA, 100, 300, NA, 200, NA)
VAR <- c('Event.Budget_Cat26', 'Event.Budget_Cat26', NA, 'Event.Budget_Cat22', 'Event.Budget_Cat22', NA, 'Event.Budget_Cat21', NA)
df <- data_frame(KEY, CAT, DATE, AMOUNT, VAR, VALUE = NA)
summary_rows <- df %>%
group_by(KEY, CAT, DATE) %>%
summarise(VAR = paste0('Scalar.Budget_Cat', max(CAT), '_Amt'),
VALUE = sum(AMOUNT, na.rm = T),
AMOUNT = NA) %>%
select(KEY, CAT, DATE, AMOUNT, VAR, VALUE) %>%
arrange(KEY, -CAT, DATE)
df[is.na(AMOUNT),] <- summary_rows
df
## KEY CAT DATE AMOUNT VAR VALUE
## <dbl> <dbl> <chr> <dbl> <chr> <dbl>
## 1 1 26 2015/10/1 1400 Event.Budget_Cat26 NA
## 2 1 26 2015/10/1 300 Event.Budget_Cat26 NA
## 3 1 26 2015/10/1 NA Scalar.Budget_Cat26_Amt 1700
## 4 1 22 2015/10/1 100 Event.Budget_Cat22 NA
## 5 1 22 2015/10/1 300 Event.Budget_Cat22 NA
## 6 1 22 2015/10/1 NA Scalar.Budget_Cat22_Amt 400
## 7 2 21 2014/1/1 200 Event.Budget_Cat21 NA
## 8 2 21 2014/1/1 NA Scalar.Budget_Cat21_Amt 200
Upvotes: 0