Compute conditionally across rows in data.table in R

Question

I have a data.table with three relevant columns: id, timepoint and metric (actual size is much larger).

I am trying to calculate the percent change between the metric values at timepoints A and D and use it to create a label (Good metric, Half-decent metric, Subpar metric).

The situation becomes more complicated because if the metric is less than or equal to 2, then the new column should report "Super metric!". If not, then the percent difference should be calculated. Based off of the percent change, the id's will be reported as either "Subpar metric" (< 30%), "Half-decent metric"(30 - 50%), "Good metric" (>50%). If there is an NA value at timepoints A or D, then returning NA is okay. If timepoint A or D are missing, also return NA.

My initial thought was that I could calculate this in data.table without creating unnecessary columns, but I haven't even been able to get the more simple solution where I do the calculations separately and then join them later.

# Example data

library(data.table)
dat <- data.table(id = c(1,1,1,1,2,2,3,3,3,3,4,4,4,6,6,10,10,10,11,11,12,12,14,14),
                  timepoint = c("A","B","C","D","A","D","A","B","C","D","A","B","C","A","D","A","B","D", "A","D","A","D", "A","D"),
                  metric = c(NA, 3, 3, 4, 4, 2, 3, 3, 2, 1, 4, 3, NA, NA, 4, 1, 5, 2, 5,3, 5,5,6,3))

Partial solution: first identify the "Super metric" id's, but I would like this to class all instances of "Super metric" id's as such (right now it returns "Super metric" only for timepoint D.

# Inefficient solution
# Step 1: Identify id's that need to be computed

dat1 <- dat[, `:=` (Metric_score = if (metric <= 2 & timepoint == "D")
            Metric_score = "Super metric"
            else Metric_score = "Calc PC"),
            by = 'id,timepoint']


# id timepoint metric Metric_score
# 1:  1         A     NA      Calc PC
# 2:  1         B      3      Calc PC
# 3:  1         C      3      Calc PC
# 4:  1         D      4      Calc PC
# 5:  2         A      4      Calc PC # Should be Super metric
# 6:  2         D      2 Super metric

Performing the calculation: This calculates the percent change for all ID's, regardless of whether or not it needs to be calculated

# Step 2: Calculate percent change between timepoint D and A

dat[ , `:=`(col = (metric[timepoint == "A"] - metric[timepoint == "D"])/metric[timepoint == "A"]*100), by = 'id']

Desired output: Class each metric as "Super metric" when final score (timepoint D) is <= 2, otherwise, calculate percent change ((metric@timeD-metric@timeA)/metric@timeA)*100) and classify based on result ("Subpar metric" (< 30%), "Half-decent metric"(30 - 50%), "Good metric" (>50%)

id	timepoint	metric	metric_class
1	A	NA	NA
1	B	3	NA
1	C	3	NA
1	D	4	NA
2	A	4	Super metric
2	D	2	Super metric
3	A	3	Super metric
3	B	3	Super metric
3	C	2	Super metric
3	D	1	Super metric
4	A	4	NA
4	B	3	NA
4	C	NA	NA
6	A	NA	NA
6	D	4	NA
10	A	1	Super metric
10	B	5	Super metric
10	D	2	Super metric
11	A	5	Half-decent metric
11	D	3	Half-decent metric
12	A	5	Subpar metric
12	D	5	Subpar metric
14	A	6	Good metric
14	D	3	Good metric

koolmees · Accepted Answer

Using fcase should give you a desirable result.

Since 0.5 is both between 0.3-0.5 and >= 0.5 it will take the first case in the list which is "Good metric" in this case, if you want that changed you can simply change the order.

metrics <- dcast.data.table(dat, id~timepoint)
metrics[, metric_class := fcase(D <= 2, "Super metric",
                                abs(D-A)/A < 0.3, "Subpar metric",
                                abs(D-A)/A >= 0.5, "Good metric",
                                between(abs(D-A)/A, 0.3, 0.5), "Half-decent metric")]

dat <- merge(dat, metrics[, .(id, metric_class)], by = "id")

Compute conditionally across rows in data.table in R

Answers (2)

Related Questions