problems with data preparation

Question

Sorry for the unspecific title. Here's the data.table of interest:

dt <- data.table(K=c("A","A","A","B","B","B"),Y=c("2010","2010","2011","2011","2011","2010"),Q1=c(2,3,4,1,3,4),Q2=c(3,3,3,1,1,1))
dt
   K    Y Q1 Q2
1: A 2010  2  3
2: A 2010  3  3
3: A 2011  4  3
4: B 2011  1  1
5: B 2011  3  1
6: B 2010  4  1

Let's say the values of K are persons, so we have two here. Quarters of year are stored in Q1 and Q2. Q2 is kind of a reference quarter-variable and the values always relate to year 2011). Now I want to pick those lines in dt, where, for each Person in K, Q1 lies in an interval of 4 quarters before the value of Q2.

An example:
Person A has value 3 in Q2, so values 2 (2011), 1(2011), 4(2010), and 3 (2010) should be picked. Considering this dataset, this would just be line 2. Value Q1=4 in line 3 is too large, value Q1=2 in line 1 is too small. For the second Person "B", only line 6 would be chosen. Not line 4, because this is the same quarter as in Q2 (I want only those smaller than the value in Q2, and line 5 is obviously greater than the value in Q2.

dt_new
   K    Y Q1 Q2
1: A 2010  3  3
2: B 2010  4  1

To sum up:
A value of say 4 in Q2 would mean: Pick all values in Q1 smaller than 4 where Y=2011, and pick all values in Q1 equal or greater than 4 (so just 4), where Y=2010. result: 3(2011),2(2011),1(2011),4(2010). This rule applies for all values of Q2. All this should be done for each Person.

I hope my problem got clear. I think there are many ways to solve this, but since I'm still learning data.table, I wanted to ask you for nice and elegant solutions (hopefully there are any).

Thanks

Edit:
Nearly found a solution: This gives me a logical vector. How can I extract the lines in the dataset?

setkey(dt,K)
dt[,(Q1=Q2 & Y=="2010"),by="K"]
   K    V1
1: A FALSE
2: A  TRUE
3: A FALSE
4: B FALSE
5: B FALSE
6: B  TRUE

without doing this:

log <-dt[,(Q1=Q2 & Y=="2010"),by="K"]$V1
dt[log]

flodel · Accepted Answer

This is a vanilla row-wise filtering so you don't need to (or should not) use grouping (by = "K"), just do:

dt[(Q1 < Q2 & Y == "2011") | (Q1 >= Q2 & Y == "2010"), ]

or maybe something more flexible if you are going to use ranges other than just 4 quarters:

quarter.diff <- function(Q1, Y1, Q2, Y2) {
  4L * (as.integer(Y2) - as.integer(Y1)) +
       (as.integer(Q2) - as.integer(Q1))
}

dt[quarter.diff(Q1, Y, Q2, Y2 = "2011")  > 0L &
   quarter.diff(Q1, Y, Q2, Y2 = "2011") <= 4L, ]

This is not just more general, it reads much better and makes the reference-year-is-2011 assumption explicit.

Notice how I was careful to convert all your columns into integers inside the quarter.diff function. Ideally, your year and quarter data would already be stored as integers rather than character or numeric.

Finally, if you are concerned that quarter.diff is called twice and speed is a concern, you can temporarily store the result as @Arun suggested in the comments:

dt[{qdiff <- quarter.diff(Q1, Y, Q2, Y2 = "2011")
    qdiff > 0L & qdiff <= 4L}, ]

problems with data preparation

Answers (1)

Related Questions