VeraShao
VeraShao

Reputation: 63

conditional fill in values into specific row in a data table with R

I have a table table as below: df

KEY    CAT      DATE   AMOUNT     VAR            VALUE
1      26    2015/10/1  1400 Event.Budget_Cat26   NA
1      26    2015/10/1  300  Event.Budget_Cat26   NA
1      26    2015/10/1  NA        NA              NA
1      22    2015/10/1  100  Event.Budget_Cat22   NA
1      22    2015/10/1  300  Event.Budget_Cat22   NA
1      22    2015/10/1  NA        NA              NA
2      21    2014/1/1   200  Event.Budget_Cat21   NA
2      21    2014/1/1   NA        NA              NA

This is just part of the large table. Basically, each row is unique by (KEY,CAT, DATE). I want to find a way to figure out the sum of amount based on the unique identify and put the final sum into VALUE column. Also, give a name to the VAR. The AMOUNT columns exist just for the VALUE columns and I will delete the whole columns in the end, so keep it as NA is okay.

The final result will be something like

KEY    CAT      DATE   AMOUNT     VAR                 VALUE
1      26    2015/10/1  1400 Event.Budget_Cat26        NA
1      26    2015/10/1  300  Event.Budget_Cat26        NA
1      26    2015/10/1  NA   Scalar.Budget_Cat26_Amt  1700
1      22    2015/10/1  100  Event.Budget_Cat22        NA
1      22    2015/10/1  300  Event.Budget_Cat22        NA
1      22    2015/10/1  NA   Scalar.Budget_Cat22_Amt   400
2      21    2014/1/1   200  Event.Budget_Cat21        NA
2      21    2014/1/1   NA   Scalar.Budget_Cat21_Amt   200

I have try the rbind and lapply, but the result is not what I want

df[, 4:6 := lapply(.SD, 
       function(x) replace(x, is.na(x), c("",paste("Scalar_Budget_Cat",CAT,"_Amt",sep =""),sum(x, na.rm=TRUE))), 
             KEY, .SDcols=4]

Could anyone help me to figure out a fast way to run through the whole data table, which contains 2,600,000 rows? Thank you

Upvotes: 0

Views: 1527

Answers (3)

Martin Schmelzer
Martin Schmelzer

Reputation: 23909

I tried it on 2.6 million rows and it is really fast. We split the task in two operations. The first one is to group the rows by KEY, CAT and DATE and then sum over the AMOUNT column. After that we select the rows where is.na(VAR) == T and fill the VAR column with the corresponding string.

df[, VALUE:=sum(AMOUNT, na.rm = T),by=list(KEY, CAT, DATE)][is.na(VAR), VAR:=paste0("Scalar_Budget_Cat",CAT,"_Amt"),]

The difference to your desired output is, that there are no more NA values in VALUE. But that can be set back afterwards using df[!is.na(AMOUNT), VALUE:=NA,]

Upvotes: 1

akrun
akrun

Reputation: 887541

There are ways to create 'VALUE' with the sum of 'AMOUNT' after grouping by 'KEY', 'CAT', 'DATE' and later changing those values to NA based on whether it is not the last value for group. Instead of doing that, we use rep to create NA and sum of 'AMOUNT' for the 'VALUE' column and modifying the 'VAR' column to have the last element for each group with `Scalar' substring

setDT(df1)[, c("VAR", "VALUE") := .(c(VAR[-.N], paste0("Scalar.", 
  sub("^[^.]+\\.", "", VAR[1]), "_Amt")), rep(c(NA, sum(AMOUNT, na.rm = TRUE)),
          c(.N-1, 1))), .(KEY, CAT, DATE)]
df1
#   KEY CAT      DATE AMOUNT                     VAR VALUE
#1:   1  26 2015/10/1   1400      Event.Budget_Cat26    NA
#2:   1  26 2015/10/1    300      Event.Budget_Cat26    NA
#3:   1  26 2015/10/1     NA Scalar.Budget_Cat26_Amt  1700
#4:   1  22 2015/10/1    100      Event.Budget_Cat22    NA
#5:   1  22 2015/10/1    300      Event.Budget_Cat22    NA
#6:   1  22 2015/10/1     NA Scalar.Budget_Cat22_Amt   400
#7:   2  21  2014/1/1    200      Event.Budget_Cat21    NA
#8:   2  21  2014/1/1     NA Scalar.Budget_Cat21_Amt   200

Upvotes: 1

Oriol Mirosa
Oriol Mirosa

Reputation: 2826

Here's a solution with tidyverse:

library(tidyverse)

KEY <- c(1, 1, 1, 1, 1, 1, 2, 2)
CAT <- c(26, 26, 26, 22, 22, 22, 21,21)
DATE <- c('2015/10/1', '2015/10/1', '2015/10/1', '2015/10/1', '2015/10/1', '2015/10/1', '2014/1/1', '2014/1/1')
AMOUNT <- c(1400, 300, NA, 100, 300, NA, 200, NA)
VAR <- c('Event.Budget_Cat26', 'Event.Budget_Cat26', NA, 'Event.Budget_Cat22', 'Event.Budget_Cat22', NA, 'Event.Budget_Cat21', NA)

df <- data_frame(KEY, CAT, DATE, AMOUNT, VAR, VALUE = NA)

summary_rows <- df %>%
  group_by(KEY, CAT, DATE) %>%
  summarise(VAR = paste0('Scalar.Budget_Cat', max(CAT), '_Amt'),
            VALUE = sum(AMOUNT, na.rm = T),
            AMOUNT = NA) %>%
  select(KEY, CAT, DATE, AMOUNT, VAR, VALUE) %>%
  arrange(KEY, -CAT, DATE)

df[is.na(AMOUNT),] <- summary_rows

df

##     KEY   CAT      DATE AMOUNT                     VAR VALUE
##   <dbl> <dbl>     <chr>  <dbl>                   <chr> <dbl>
## 1     1    26 2015/10/1   1400      Event.Budget_Cat26    NA
## 2     1    26 2015/10/1    300      Event.Budget_Cat26    NA
## 3     1    26 2015/10/1     NA Scalar.Budget_Cat26_Amt  1700
## 4     1    22 2015/10/1    100      Event.Budget_Cat22    NA
## 5     1    22 2015/10/1    300      Event.Budget_Cat22    NA
## 6     1    22 2015/10/1     NA Scalar.Budget_Cat22_Amt   400
## 7     2    21  2014/1/1    200      Event.Budget_Cat21    NA
## 8     2    21  2014/1/1     NA Scalar.Budget_Cat21_Amt   200

Upvotes: 0

Related Questions