Reputation: 2251
I have the following data frame:
Source mean1 SD median range_min range_max IQR_25 IQR_75
1 1.5 0.2 3 NA NA NA NA
2 NA NA 2 1 5 1.5 4
3 NA NA 4 2 7 NA NA
I want to add mean2
and FLAG
columns into the data frame based on certain conditions. The conditions are the following:
1) If mean1
and sd
are present then copy the same value into the mean2
column and give it FLAG ==1
2) If mean1
and sd
are NA
but median:IQR_75
are present then put value equal 20 for mean2
and give it a FLAG==2
. Note: the value 20 is arbitrary. I will be putting an equation instead but this is to make it easy for the question here.
3) If if mean1
and sd
are NA
but median, RANGE_min, RANGE_max
are present then then put value equal 30 for mean2
and give it a FLAG==3
.
The result should look like the following:
Source mean1 SD median range_min range_max IQR_25 IQR_75 mean2 FLAG
1 1.5 0.2 3 NA NA NA NA 1.5 1
2 NA NA 2 1 5 1.5 4 20 2
3 NA NA 4 2 7 NA NA 30 3
I have tried the following but didn't have much luck in using it:
df$mean2 <- with(df, (is.na(mean1)==F&(is.na(SD)==F))*mean1+ (is.na(mean1)==T & is.na(SD)==T & is.na(median)==F &
is.na(range_min)==F & is.na(range_max)==F & is.na(IQR_25)==F &
is.na(IQR_75)==F)*20)
I would ask your help in acheving my goal so I can apply it on my big data set.
Upvotes: 1
Views: 67
Reputation: 4349
Try this:
df$mean2 = NA
df$FLAG = NA
ind1 = complete.cases(df[, c("mean1", "SD")])
ind2 = complete.cases(df[, c("median", "range_min", "range_max", "IQR_25", "IQR_75")])
ind3 = complete.cases(df[, c("median", "range_min", "range_max")])
df$mean2[ind1] = df$mean1[ind1]
df$mean2[!ind1 & ind2] = 20
df$mean2[!ind1 & !ind2 & ind3] = 30
df$FLAG[ind1] = 1
df$FLAG[!ind1 & ind2] = 2
df$FLAG[!ind1 & !ind2 & ind3] = 3
Upvotes: 2
Reputation: 83275
With the data.table
package you could do it as follows:
library(data.table)
setDT(df)[!is.na(mean1) & !is.na(SD), `:=` (mean2 = mean1, Flag = 1)
][is.na(mean1) & is.na(SD) & complete.cases(median,range_min,range_max,IQR_25,IQR_75),
`:=` (mean2 = 20, Flag = 2)
][is.na(mean1) & is.na(SD) & complete.cases(median,range_min,range_max) & is.na(IQR_25) & is.na(IQR_75),
`:=` (mean2 = 30, Flag = 3)]
this gives:
> df
Source mean1 SD median range_min range_max IQR_25 IQR_75 mean2 Flag
1: 1 1.5 0.2 3 NA NA NA NA 1.5 1
2: 2 NA NA 2 1 5 1.5 4 20.0 2
3: 3 NA NA 4 2 7 NA NA 30.0 3
Alternatively, you can create indexes for your several conditions beforehand. This will give much cleaner data.table
syntax:
indx1 <- complete.cases(df[c("mean1", "SD")])
indx2 <- complete.cases(df[c("median","range_min","range_max","IQR_25","IQR_75")])
indx3 <- !complete.cases(df[c("IQR_25","IQR_75")]) & complete.cases(df[c("median","range_min","range_max")])
library(data.table)
setDT(df)[indx1, `:=` (mean2 = mean1, Flag = 1)
][!indx1 & indx2, `:=` (mean2 = 20, Flag = 2)
][!indx1 & indx3, `:=` (mean2 = 30, Flag = 3)]
Upvotes: 4
Reputation: 887851
We can create a couple of logical indexes based on the NA elements in the specified columns. The 'indx' gives TRUE if both 'mean1' and 'SD' are not NA, 'indx2' will be TRUE if there are no NA values in a row for 'median:IQR_75' column as we are taking the rowSums
of the NA elements. Similarly, 'indx3' gives TRUE for non-NA elements for columns 'median:range_max'.
indx <- rowSums(!is.na(df1[c('mean1', 'SD')]))==2
indx2 <- !rowSums(is.na(df1[4:ncol(df1)]))
indx3 <- !rowSums(is.na(df1[4:6]))
Now we can create a numeric index by arithmetic operations to create a unique index ('indx4') which can be used to populate the values 1.5, 30, 20 or 1:3.
indx4 <- as.numeric(factor(1+2*indx+4*indx2+8*indx3))
c(1.5, 30, 20)[indx4]
#[1] 1.5 20.0 30.0
c(1,3,2)[indx4]
#[1] 1 2 3
Or we use a nested ifelse
df1$mean2 <- ifelse(indx, 1.5, ifelse(indx2, 20, ifelse(indx3, 30, NA)))
df1$mean2
#[1] 1.5 20.0 30.0
df1$FLAG <- ifelse(indx, 1, ifelse(indx2, 2, ifelse(indx3, 3, NA)))
df1$FLAG
# [1] 1 2 3
df1
# Source mean1 SD median range_min range_max IQR_25 IQR_75 mean2 FLAG
#1 1 1.5 0.2 3 NA NA NA NA 1.5 1
#2 2 NA NA 2 1 5 1.5 4 20.0 2
#3 3 NA NA 4 2 7 NA NA 30.0 3
df1 <- structure(list(Source = 1:3, mean1 = c(1.5, NA, NA), SD = c(0.2,
NA, NA), median = c(3L, 2L, 4L), range_min = c(NA, 1L, 2L),
range_max = c(NA,
5L, 7L), IQR_25 = c(NA, 1.5, NA), IQR_75 = c(NA, 4L, NA)),
.Names = c("Source",
"mean1", "SD", "median", "range_min", "range_max", "IQR_25",
"IQR_75"), class = "data.frame", row.names = c(NA, -3L))
Upvotes: 2