R: Dealing with time mismatches while using data.table

Question

Hey I'm trying to transition from using ddply to using data.table, and I'm close to figuring it out, but I have slight adjustments that I still need to make. Here's the summary of what I'm trying to do, using a toy data set:

Let's say I have sales data for two products for multiple weeks.

    x <- structure(list(week = c(1, 1, 2, 3, 1, 2, 2, 3, 4), product = c("a", 
    "a", "a", "a", "b", "b", "b", "b", "b"), sold = c(10, 15, 20, 
    25, 30, 35, 40, 45, 50)), .Names = c("week", "product", "sold"
    ), row.names = c(NA, -9L), class = c("data.table", "data.frame"
    ), sorted = c("product", "week"))

        week product sold
     1:    1       a   10
     2:    1       a   15
     3:    2       a   20
     4:    3       a   25
     5:    1       b   30
     6:    2       b   35
     7:    2       b   40
     8:    3       b   45
     9:    4       b   50

I want to find the total sales for product i for week j, i.e. I sold 25 units of product a for week 1.

I used the following code to do this:

    setDT(x)
    setkey(x,product,week)
    > x1 <- x[x,sum(sold)]
    > x1
       product week V1
    1:       a    1 25
    2:       a    1 25
    3:       a    2 20
    4:       a    3 25
    5:       b    1 30
    6:       b    2 75
    7:       b    2 75
    8:       b    3 45
    9:       b    4 50

The problem is I'm not sure how to remove the duplicate rows, ie. row 2 is redundant. Additionally, I also want to include NA for weeks where the product didn't sell, ie. a row for product a week 4 with value NA.

I'm sure this is an easy question and I know how to do this in ddply, but I couldn't find what I was looking for through searching. If anyone could help out or link me to the right page if this is a duplicate, that would be great.

GSee · Accepted Answer

Here's how you can cross join by unique products and all weeks, then sum each group.

> x[CJ(unique(product), 1:4), sum(sold), by=.EACHI]
   product week V1
1:       a    1 25
2:       a    2 20
3:       a    3 25
4:       a    4 NA
5:       b    1 30
6:       b    2 75
7:       b    3 45
8:       b    4 50

If you're using data.table version <= 1.9.2, then just remove the by = .EACHI part. It's a new design change from the next version on (currently implemented in 1.9.3 devel). Check out the NEWS for more info on this.

R: Dealing with time mismatches while using data.table

Answers (2)

Related Questions