Reputation: 459
I have the following data:
Group Subject Lab LabValue Visit Baseline Count
1 001 Lab1 10 Day 1 Y 1
1 001 Lab1 11 Day 2 2
1 001 Lab1 12 Day 30 3
1 002 Lab1 11 Day 1 Y 1
1 002 Lab1 12 Day 30 2
2 005 Lab1 9 Day 1 Y 1
2 005 Lab1 16 Day 2 2
2 005 Lab1 11 Month 1 3
2 006 Lab1 18 Day 1 Y 1
2 006 Lab1 10 Day 2 2
I would like to create two tables:
The first table which displays change from baseline values for Lab1, which calculates the mean at visit and recalculates the baseline with the new subset of patients with that particular visit:
Visit NumberSubjects VisitMean BaselineMean MeanChangeFromBaseline
Baseline 4 12 12 .
Day 2 3 12.3 13 -0.67
Day 30 2 12 10.5 +1.5
Month 1 1 11 9 +2
The second table allows for the option of 'pushing' the last visit all of the subjects without a visit at Month 1 in to the Month 1 row (i.e. subjects 001, 002, and 005):
Visit NumberSubjects VisitMean BaselineMean MeanChangeFromBaseline
Baseline 4 12 12 .
Month 1 4 11.25 12 -0.75
(For subjects 001 and 002, the values on Day 30 become the values for Month 1; for subject 006, the value for Day 2 becomes the value for Month 1).
This is quite complex but I wanted to see if there was a method for doing this using dplyr.
I believe I can select the last value for each subject with the following code:
do.call("rbind",
by(df, INDICES=df$Subject, FUN=function(DF) DF[which.max(DF$count), ]))
Any insight as to how i can create these two tables would be appreciated.
Upvotes: 4
Views: 293
Reputation: 30474
Here is how I might approach this.
Edit To address the problem if lack of Baseline
value for a given Subject
, as suggested by OP in comment, use the first LabValue
available (assumes sorted by Visit
). In addition, to address the problem of multiple LabValue
per Visit
for a given Subject
, add group_by(Visit, Subject)
and slice(1)
to use the first value available. The code has been updated, the output should be the same for the original example.
For Table 1, you can group_by(Subject)
and then add an additional column of the LabValue
at baseline for each subject. This will make it easier to compute statistics by visit.
library(tidyverse)
# Table 1
df %>%
group_by(Subject) %>%
mutate(Baseline = first(LabValue)) %>%
group_by(Visit, Subject) %>%
slice(1) %>%
group_by(Visit) %>%
summarise(NumberSubjects = n(),
VisitMean = mean(LabValue),
BaselineMean = mean(Baseline),
MeanChangeFromBaseline = VisitMean - BaselineMean)
Output
# A tibble: 4 x 5
Visit NumberSubjects VisitMean BaselineMean MeanChangeFromBaseline
<chr> <int> <dbl> <dbl> <dbl>
1 Day_1 4 12 12 0
2 Day_2 3 12.3333 12.3333 0
3 Day_30 2 12 10.5 1.5
4 Month_1 1 11 9 2
For Table 2, I would recommend using something like complete
to add empty rows of data collection visits that were not done for each subject, and then fill
to do "last observation carried forward". Then the latest available LabValue
will be carried forward into Month 1.
# Table 2
df %>%
group_by(Subject) %>%
mutate(Baseline = first(LabValue)) %>%
ungroup() %>%
complete(Subject, Visit) %>%
group_by(Subject) %>%
fill(LabValue, Baseline, .direction = "down") %>%
filter(Visit == "Day_1" | Visit == "Month_1") %>%
group_by(Visit, Subject) %>%
slice(1) %>%
group_by(Visit) %>%
summarise(NumberSubjects =sum(!is.na(LabValue)),
VisitMean = mean(LabValue, na.rm = TRUE),
BaselineMean = mean(Baseline, na.rm = TRUE),
MeanChangeFromBaseline = VisitMean - BaselineMean)
Output
# A tibble: 2 x 5
Visit NumberSubjects VisitMean BaselineMean MeanChangeFromBaseline
<chr> <int> <dbl> <dbl> <dbl>
1 Day_1 4 12 12 0
2 Month_1 4 11.25 12 -0.75
Upvotes: 6