statsguyz
statsguyz

Reputation: 459

Complex table analysis with R

I have the following data:

Group           Subject           Lab               LabValue          Visit         Baseline    Count
1               001               Lab1              10                Day 1         Y           1
1               001               Lab1              11                Day 2                     2
1               001               Lab1              12                Day 30                    3
1               002               Lab1              11                Day 1         Y           1
1               002               Lab1              12                Day 30                    2
2               005               Lab1               9                Day 1         Y           1
2               005               Lab1              16                Day 2                     2
2               005               Lab1              11                Month 1                   3
2               006               Lab1              18                Day 1         Y           1
2               006               Lab1              10                Day 2                     2

I would like to create two tables:

The first table which displays change from baseline values for Lab1, which calculates the mean at visit and recalculates the baseline with the new subset of patients with that particular visit:

   Visit        NumberSubjects     VisitMean     BaselineMean    MeanChangeFromBaseline
   Baseline     4                  12            12              .
   Day 2        3                  12.3          13              -0.67
   Day 30       2                  12            10.5            +1.5
   Month 1      1                  11            9               +2

The second table allows for the option of 'pushing' the last visit all of the subjects without a visit at Month 1 in to the Month 1 row (i.e. subjects 001, 002, and 005):

   Visit        NumberSubjects     VisitMean     BaselineMean    MeanChangeFromBaseline
   Baseline     4                  12            12              .
   Month 1      4                  11.25         12              -0.75

(For subjects 001 and 002, the values on Day 30 become the values for Month 1; for subject 006, the value for Day 2 becomes the value for Month 1).

This is quite complex but I wanted to see if there was a method for doing this using dplyr.

I believe I can select the last value for each subject with the following code:

do.call("rbind", 
        by(df, INDICES=df$Subject, FUN=function(DF) DF[which.max(DF$count), ]))

Any insight as to how i can create these two tables would be appreciated.

Upvotes: 4

Views: 293

Answers (1)

Ben
Ben

Reputation: 30474

Here is how I might approach this.

Edit To address the problem if lack of Baseline value for a given Subject, as suggested by OP in comment, use the first LabValue available (assumes sorted by Visit). In addition, to address the problem of multiple LabValue per Visit for a given Subject, add group_by(Visit, Subject) and slice(1) to use the first value available. The code has been updated, the output should be the same for the original example.

For Table 1, you can group_by(Subject) and then add an additional column of the LabValue at baseline for each subject. This will make it easier to compute statistics by visit.

library(tidyverse)

# Table 1
df %>%
  group_by(Subject) %>%
  mutate(Baseline = first(LabValue)) %>%
  group_by(Visit, Subject) %>% 
  slice(1) %>%
  group_by(Visit) %>%
  summarise(NumberSubjects = n(),
            VisitMean = mean(LabValue),
            BaselineMean = mean(Baseline),
            MeanChangeFromBaseline = VisitMean - BaselineMean)

Output

# A tibble: 4 x 5
  Visit   NumberSubjects VisitMean BaselineMean MeanChangeFromBaseline
  <chr>            <int>     <dbl>        <dbl>                  <dbl>
1 Day_1                4   12           12                         0  
2 Day_2                3   12.3333      12.3333                    0  
3 Day_30               2   12           10.5                       1.5
4 Month_1              1   11            9                         2 

For Table 2, I would recommend using something like complete to add empty rows of data collection visits that were not done for each subject, and then fill to do "last observation carried forward". Then the latest available LabValue will be carried forward into Month 1.

# Table 2
df %>%
  group_by(Subject) %>%
  mutate(Baseline = first(LabValue)) %>%
  ungroup() %>%
  complete(Subject, Visit) %>%
  group_by(Subject) %>%
  fill(LabValue, Baseline, .direction = "down") %>%
  filter(Visit == "Day_1" | Visit == "Month_1") %>%
  group_by(Visit, Subject) %>% 
  slice(1) %>%
  group_by(Visit) %>%
  summarise(NumberSubjects =sum(!is.na(LabValue)),
            VisitMean = mean(LabValue, na.rm = TRUE),
            BaselineMean = mean(Baseline, na.rm = TRUE),
            MeanChangeFromBaseline = VisitMean - BaselineMean)

Output

# A tibble: 2 x 5
  Visit   NumberSubjects VisitMean BaselineMean MeanChangeFromBaseline
  <chr>            <int>     <dbl>        <dbl>                  <dbl>
1 Day_1                4     12              12                   0   
2 Month_1              4     11.25           12                  -0.75

Upvotes: 6

Related Questions