Amer
Amer

Reputation: 2251

Apply calculations based on certain conditions in a data frame

I have the following data frame:

Source  mean1   SD  median  range_min   range_max   IQR_25  IQR_75  
1       1.5     0.2 3       NA          NA          NA      NA      
2       NA      NA  2       1           5           1.5     4       
3       NA      NA  4       2           7           NA      NA      

I want to add mean2 and FLAG columns into the data frame based on certain conditions. The conditions are the following:

1) If mean1 and sd are present then copy the same value into the mean2 column and give it FLAG ==1

2) If mean1 and sd are NA but median:IQR_75 are present then put value equal 20 for mean2 and give it a FLAG==2. Note: the value 20 is arbitrary. I will be putting an equation instead but this is to make it easy for the question here.

3) If if mean1 and sd are NA but median, RANGE_min, RANGE_max are present then then put value equal 30 for mean2 and give it a FLAG==3.

The result should look like the following:

Source  mean1   SD  median  range_min   range_max   IQR_25  IQR_75  mean2   FLAG
1       1.5     0.2 3       NA          NA          NA      NA      1.5     1
2       NA      NA  2       1           5           1.5     4       20      2
3       NA      NA  4       2           7           NA      NA      30      3

I have tried the following but didn't have much luck in using it:

df$mean2 <- with(df, (is.na(mean1)==F&(is.na(SD)==F))*mean1+ (is.na(mean1)==T & is.na(SD)==T & is.na(median)==F &
                        is.na(range_min)==F & is.na(range_max)==F & is.na(IQR_25)==F &
                        is.na(IQR_75)==F)*20)

I would ask your help in acheving my goal so I can apply it on my big data set.

Upvotes: 1

Views: 67

Answers (3)

Ricardo Oliveros-Ramos
Ricardo Oliveros-Ramos

Reputation: 4349

Try this:

df$mean2 = NA
df$FLAG = NA

ind1 = complete.cases(df[, c("mean1", "SD")])
ind2 = complete.cases(df[, c("median", "range_min", "range_max", "IQR_25", "IQR_75")])
ind3 = complete.cases(df[, c("median", "range_min", "range_max")])

df$mean2[ind1] = df$mean1[ind1]
df$mean2[!ind1 & ind2] = 20 
df$mean2[!ind1 & !ind2 & ind3] = 30

df$FLAG[ind1] = 1
df$FLAG[!ind1 & ind2] = 2 
df$FLAG[!ind1 & !ind2 & ind3] = 3

Upvotes: 2

Jaap
Jaap

Reputation: 83275

With the data.table package you could do it as follows:

library(data.table)
setDT(df)[!is.na(mean1) & !is.na(SD), `:=` (mean2 = mean1, Flag = 1)
          ][is.na(mean1) & is.na(SD) & complete.cases(median,range_min,range_max,IQR_25,IQR_75),
            `:=` (mean2 = 20, Flag = 2)
            ][is.na(mean1) & is.na(SD) & complete.cases(median,range_min,range_max) & is.na(IQR_25) & is.na(IQR_75),
              `:=` (mean2 = 30, Flag = 3)]

this gives:

> df
   Source mean1  SD median range_min range_max IQR_25 IQR_75 mean2 Flag
1:      1   1.5 0.2      3        NA        NA     NA     NA   1.5    1
2:      2    NA  NA      2         1         5    1.5      4  20.0    2
3:      3    NA  NA      4         2         7     NA     NA  30.0    3

Alternatively, you can create indexes for your several conditions beforehand. This will give much cleaner data.table syntax:

indx1 <- complete.cases(df[c("mean1", "SD")])
indx2 <- complete.cases(df[c("median","range_min","range_max","IQR_25","IQR_75")])
indx3 <- !complete.cases(df[c("IQR_25","IQR_75")]) & complete.cases(df[c("median","range_min","range_max")])

library(data.table)
setDT(df)[indx1, `:=` (mean2 = mean1, Flag = 1)
          ][!indx1 & indx2, `:=` (mean2 = 20, Flag = 2)
            ][!indx1 & indx3, `:=` (mean2 = 30, Flag = 3)]

Upvotes: 4

akrun
akrun

Reputation: 887851

We can create a couple of logical indexes based on the NA elements in the specified columns. The 'indx' gives TRUE if both 'mean1' and 'SD' are not NA, 'indx2' will be TRUE if there are no NA values in a row for 'median:IQR_75' column as we are taking the rowSums of the NA elements. Similarly, 'indx3' gives TRUE for non-NA elements for columns 'median:range_max'.

 indx <- rowSums(!is.na(df1[c('mean1', 'SD')]))==2
 indx2 <- !rowSums(is.na(df1[4:ncol(df1)]))
 indx3 <- !rowSums(is.na(df1[4:6]))

Now we can create a numeric index by arithmetic operations to create a unique index ('indx4') which can be used to populate the values 1.5, 30, 20 or 1:3.

 indx4 <- as.numeric(factor(1+2*indx+4*indx2+8*indx3))
  c(1.5, 30, 20)[indx4]
 #[1]  1.5 20.0 30.0
 c(1,3,2)[indx4]
 #[1] 1 2 3

Or we use a nested ifelse

 df1$mean2 <- ifelse(indx, 1.5, ifelse(indx2, 20, ifelse(indx3, 30, NA)))
 df1$mean2
 #[1]  1.5 20.0 30.0

 df1$FLAG <- ifelse(indx, 1, ifelse(indx2, 2, ifelse(indx3, 3, NA)))
 df1$FLAG
 # [1] 1 2 3

df1
#   Source mean1  SD median range_min range_max IQR_25 IQR_75 mean2 FLAG
#1      1   1.5 0.2      3        NA        NA     NA     NA   1.5    1
#2      2    NA  NA      2         1         5    1.5      4  20.0    2
#3      3    NA  NA      4         2         7     NA     NA  30.0    3

data

df1 <- structure(list(Source = 1:3, mean1 = c(1.5, NA, NA), SD = c(0.2, 
 NA, NA), median = c(3L, 2L, 4L), range_min = c(NA, 1L, 2L),
 range_max = c(NA, 
 5L, 7L), IQR_25 = c(NA, 1.5, NA), IQR_75 = c(NA, 4L, NA)),
 .Names = c("Source", 
 "mean1", "SD", "median", "range_min", "range_max", "IQR_25", 
 "IQR_75"), class = "data.frame", row.names = c(NA, -3L))

Upvotes: 2

Related Questions