stats134711
stats134711

Reputation: 636

Dealing with NA and boolean sums in R

I'm trying to create a new variable that involves the sum of products of a numeric value and a logical (indicator), but the final result does not make sense.

DATA

I've generated the following dataset as a minimum reproducible example for the issue I'm having. Each row is an individual and the columns mJSW_BL, mJSW_12, mJSW_24, and mJSW_36 are the measurements at baseline, 12, 24 and 36. The last variable I'm creating, JSNCASE_TP indicates the first time (12, 24 or 36) for which an individual meets the definition of of case (decrease from baseline by 0.7). The calculation of JSNCASE_TP should ignore NA values and can take on the values 0,12,24, or 36.

require(dplyr)

set.seed(1)
N = 10
mJSW_BL <- runif(N,0.1,2)
mJSW_12 <- runif(N,0.1,2)
mJSW_24 <- runif(N,0.1,2)
mJSW_36 <- runif(N,0.1,2)

#Randomly set some values to NA
mJSW_12[sample(N,2)] <- NA
mJSW_36[sample(N,1)] <- NA

#Create dataframe
df <- data.frame(mJSW_BL,mJSW_12,mJSW_24,mJSW_36)

df2 <- df %>%
       #Create variables indicating decrease from BL
       mutate(mJSW_BLto12 = mJSW_BL - mJSW_12,
              mJSW_BLto24 = mJSW_BL - mJSW_24,
              mJSW_BLto36 = mJSW_BL - mJSW_36) %>%
       #JSN case - decrease by 0.7 from BL
       mutate(JSNCASE_12 = (mJSW_BLto12>=0.7),
              JSNCASE_24 = (mJSW_BLto24>=0.7),
              JSNCASE_36 = (mJSW_BLto36>=0.7)) %>%
       #Which timepoint did JSN first occur?
       mutate(JSNCASE_TP = sum(12*JSNCASE_12, 
                               24*(JSNCASE_24 & !JSNCASE_12),
                               36*(JSNCASE_36 & !(JSNCASE_12 | JSNCASE_24)),
                               na.rm=TRUE))    

ISSUES

In data df2, take for example, row 4, where JSNCASE_12, JSNCASE_24, and JSNCASE_36 are all TRUE, but JSNCASE_TP=36 . It should be JSNCASE_TP=12. Additionally, take row 6, where JSNCASE_12=NA, JSNCASE_24=TRUE, and JSNCASE_36=FALSE. I should get JSNCASE_TP=24. Maybe I'm missing something basic, but I've tried several ways and haven't produced the desired result. The values of JSNCASE_TP for the 10 rows should be 0,0,0,12,0,24,24,0,0,0.

EDIT Thanks to @Dave2e's comments, the code below works:

df2 <- df %>%
   #Create variables indicating decrease from BL
   mutate(mJSW_BLto12 = mJSW_BL - mJSW_12,
          mJSW_BLto24 = mJSW_BL - mJSW_24,
          mJSW_BLto36 = mJSW_BL - mJSW_36) %>%
   #JSN case - decrease by 0.7 from BL
   mutate(JSNCASE_12 = (mJSW_BLto12>=0.7),
          JSNCASE_24 = (mJSW_BLto24>=0.7),
          JSNCASE_36 = (mJSW_BLto36>=0.7)) %>%
   rowwise() %>%
   #Which timepoint did JSN first occur?
   mutate(JSNCASE_TP = sum(12*JSNCASE_12, 
                           24*(JSNCASE_24 & (!JSNCASE_12| is.na(JSNCASE_12))),
                           36*(JSNCASE_36 & ((!JSNCASE_12 | is.na(JSNCASE_12)) & 
                                             (!JSNCASE_24 | is.na(JSNCASE_24)))),
                           na.rm=TRUE))    

Upvotes: 1

Views: 371

Answers (2)

Dave2e
Dave2e

Reputation: 24089

Having the the NA mixed with the TRUE/FALSE does complicates things.

Here is a hack using the apply function. Basically finds the first column with TRUE in it then multiples by 12 to get the proper time. Since it is possible all columns are FALSE, it needs to check and handle cases where inf values are returned by the min function.

df2 <- df %>%
  #Create variables indicating decrease from BL
  mutate(mJSW_BLto12 = mJSW_BL - mJSW_12,
         mJSW_BLto24 = mJSW_BL - mJSW_24,
         mJSW_BLto36 = mJSW_BL - mJSW_36) %>%
  #JSN case - decrease by 0.7 from BL
  mutate(JSNCASE_12 = (mJSW_BLto12>=0.7),
         JSNCASE_24 = (mJSW_BLto24>=0.7),
         JSNCASE_36 = (mJSW_BLto36>=0.7))


df2$JSNCASE_TP<-12*apply(df2[,8:10], 1, function(x){ ifelse(is.infinite(min(which(x==TRUE))), 0, min(which(x==TRUE)) )})

I'm sure there is a possible dplyr version of this.

Upvotes: 1

A. Suliman
A. Suliman

Reputation: 13135

We can use mutate_at and apply to get the column name if it exists otherwise 0.

library(dplyr)
df %>% 
  mutate_at(vars(matches('.*_\\d+')), list(ind=~mJSW_BL-.>=0.7)) %>% 
  mutate(JSNCASE_TP = apply(.[grepl('.*_ind',names(.))], 1, function(x){
                             x <- x[!is.na(x)]
                             ifelse(all(!x), 0, names(x)[which.max(x)])
}))

     mJSW_BL   mJSW_12   mJSW_24   mJSW_36 mJSW_12_ind mJSW_24_ind mJSW_36_ind  JSNCASE_TP
1  0.6044665 0.4913517 1.8759399 1.0159522       FALSE       FALSE       FALSE           0
2  0.8070354 0.4354578 0.5030708 1.2391751       FALSE       FALSE       FALSE           0
3  1.1884214 1.4053434 1.3381802 1.0377285       FALSE       FALSE       FALSE           0
4  1.8255948 0.8297971 0.3385547 0.4538134        TRUE        TRUE        TRUE mJSW_12_ind
5  0.4831957 1.5626987 0.6077193 1.6720093       FALSE       FALSE       FALSE           0
6  1.8069404        NA 0.8336168 1.3700868          NA        TRUE       FALSE mJSW_24_ind
7  1.8948830 1.4634752 0.1254416 1.6090557       FALSE        TRUE       FALSE mJSW_24_ind
8  1.3555158 1.9846216 0.8265371        NA       FALSE       FALSE          NA           0
9  1.2953167        NA 1.7524126 1.4750508          NA       FALSE       FALSE           0
10 0.2173939 1.5771459 0.7466631 0.8814214       FALSE       FALSE       FALSE           0

However, it's better to transfer your data into tidy data, do the analysis then transfer back into wide format. Here is one way

library(dplyr)
library(tidyr)
df %>% rownames_to_column(var = 'id') %>% 
  gather(key,value,-id) %>% 
  group_by(id) %>% 
  mutate(Ind = ifelse(is.na(value), FALSE, (first(value)-value)>=0.7), JSNCASE_TP = ifelse(any(Ind),key[Ind],'0')) %>% 
  select(-Ind) %>% ungroup() %>% 
  spread(key, value) %>% 
  arrange(as.numeric(id))

Upvotes: 1

Related Questions