Joe
Joe

Reputation: 3806

Sum most recent scores in 3 unique areas

I have dataset of areas and scores in those areas.

I want to maintain an aggregated score (agg_score) that is equal to the sum of the most recent scores for A, B, and C.

For instance you will see in my expected_output for row 4 is 7, because the calue of C is now 2 while the most recent values of A and B are still 1 & 4.

All I have been able to do so far is sum the three most recent scores, which results in agg_score values that equal the sum of C, C, and B at times. It is important that I have an accurate agg_score at each possible date.

library(dplyr)

ds <- 
  tibble(
    area = c("A", "B", "C", "C", "B", "A", "A", "B", "C"),
    score = c(1,4,5,2,6,3,4,6,3),
    scoring_date = 
      seq.Date(
        from = as.Date("2019-01-01"), 
        to = as.Date("2019-01-09"), 
        by = "days"
      ),
    expected_output = c(NA, NA, 10, 7, 9, 11, 12, 12, 13)
  ) %>%
  arrange(scoring_date)

# Inadequate code for summing last three scores
ds %>% 
  mutate(agg_score = score + lag(score) + lag(score, 2))

Upvotes: 4

Views: 90

Answers (5)

chinsoon12
chinsoon12

Reputation: 25225

Another possible data.table approach.

ds[, output := 
        ds[, 
            ds[.(area=unique(area), scd=.BY$scoring_date), 
                sum(score), 
                on=.(area=area, scoring_date<=scd), 
                mult="last"], 
            by=.(area, scoring_date)]$V1
    ]

output:

   area score scoring_date output
1:    A     1   2019-01-01     NA
2:    B     4   2019-01-02     NA
3:    C     5   2019-01-03     10
4:    C     2   2019-01-04      7
5:    B     6   2019-01-05      9
6:    A     3   2019-01-06     11
7:    A     4   2019-01-07     12
8:    B     6   2019-01-08     12
9:    C     3   2019-01-09     13

data:

library(data.table)
ds <- data.table(
    area = c("A", "B", "C", "C", "B", "A", "A", "B", "C"),
    score = c(1,4,5,2,6,3,4,6,3),
    scoring_date = seq.Date(from = as.Date("2019-01-01"), to = as.Date("2019-01-09"), by = "days"))

Explanation:

The gist of the above code is:

ds[.(area=unique(area), scd=.BY$scoring_date), 
    sum(score), 
    on=.(area=area, scoring_date<=scd), 
    mult="last"]

It means for each date (scd=.BY$scoring_date), we try to perform a non-equi self join to find the latest (mult="last") score for all areas (area=unique(area))

Upvotes: 0

Mike H.
Mike H.

Reputation: 14370

There might be a data.table self-merge option out there, but I couldn't quite figure it out. Here's an idea using implementing your fill but in data.table. Should be flexible for more "area"s:

library(data.table)

lapply(unique(ds$area), function(a){
  ds[, paste0("val_",a) := zoo::na.locf0(ifelse(area==a, score, NA))]
  invisible(return(NULL))
})
ds[, agg_score := rowSums(.SD), .SDcols = paste0("val_", unique(ds$area))][,  paste0("val_", unique(ds$area)) := NULL]

ds
#  area score scoring_date agg_score
#1    A     1   2019-01-01        NA
#2    B     4   2019-01-02        NA
#3    C     5   2019-01-03        10
#4    C     2   2019-01-04         7
#5    B     6   2019-01-05         9
#6    A     3   2019-01-06        11
#7    A     4   2019-01-07        12
#8    B     6   2019-01-08        12
#9    C     3   2019-01-09        13

Original solution:

Alternatively you could try an sapply. The function is a little long, but that's because we have a lot of work to do! If you wanted to do this on more areas you wouldn't have to manually fill each one, so that could be a benefit:

ds$agg_score <- sapply(1:nrow(ds), function(i) {other_areas <- setdiff(unique(ds$area), ds[i, "area"])
                                                f_idxs = Filter(function(x) x < i, which(ds$area %in% other_areas)) #Locate other areas that come before current index
                                                if(length(f_idxs) == 0) return(NA)
                                                idxs   = sapply(split(f_idxs, ds[f_idxs, "area"]), max) #Split based on area so we can get maximum index before our date
                                                if(length(idxs) < length(other_areas)) return(NA)
                                                sum(ds[c(idxs, i), "score"])}) #Sum up our scores

Upvotes: 2

A. Suliman
A. Suliman

Reputation: 13135

Using dplyr::last we can find the last 'recent' value for each area then sum them when length reaches 3.

#small function to clarify
sum_fun<-function(x){
  #browser()
  lc_vec <- ds[1:x,] %>% group_by(area) %>% summarise(lc=last(score)) %>% pull(lc)  
  lc_vecf <- ifelse(length(lc_vec)==3,sum(lc_vec),NA)
  return(lc_vecf)
}

library(dplyr)
ds %>% mutate(Output=sapply(1:nrow(.),sum_fun)) #Instead of sapply we can use purrr::map_dpl

# A tibble: 9 x 5
area  score scoring_date expected_output Output
<chr> <dbl> <date>                 <dbl>  <dbl>
1 A        1. 2019-01-01               NA     NA 
2 B        4. 2019-01-02               NA     NA 
3 C        5. 2019-01-03               10.    10.
4 C        2. 2019-01-04                7.     7.
5 B        6. 2019-01-05                9.     9.
6 A        3. 2019-01-06               11.    11.
7 A        4. 2019-01-07               12.    12.
8 B        6. 2019-01-08               12.    12.
9 C        3. 2019-01-09               13.    13.

Upvotes: 2

Joe
Joe

Reputation: 3806

So I found a way to do this using fill() to ensure the most recent value is always carried forward until replaced by a more recent value.

library(tidyr)
ds %>% 
  select(area, score, scoring_date) %>% 
  spread(area, score) %>% 
  fill(A, .direction = "down") %>% 
  fill(B, .direction = "down") %>% 
  fill(C, .direction = "down") %>% 
  rowwise() %>% 
  mutate(agg_score = sum(A, B, C))

Upvotes: 1

Jorge Lopez
Jorge Lopez

Reputation: 467

nuevoDs<-ds %>% arrange(desc(scoring_date)) %>% as.data.frame
#getting length of dataframe
longitud<-nrow(nuevoDs)
#we will iterate on each value up until (longitud - 2) and save results to a vector
elVector <- vector()

for(i in 1:(longitud-2))
{
  elVector[i] <- nuevoDs[i,"score"] + nuevoDs[i+1,"score"] + nuevoDs[i+2,"score"]

}

#before cbinding we need to make the vector the same length as your dataFrame

elVector[longitud-1] <- 0
elVector[longitud] <- 0

elVector

cbind(nuevoDs,elVector)




 area score scoring_date elVector
1    C     3   2019-01-09       13
2    B     6   2019-01-08       13
3    A     4   2019-01-07       13
4    A     3   2019-01-06       11
5    B     6   2019-01-05       13
6    C     2   2019-01-04       11
7    C     5   2019-01-03       10
8    B     4   2019-01-02        0
9    A     1   2019-01-01        0

Upvotes: 0

Related Questions