Reputation: 356
Creating a new column in my data frame of > 900,000 rows to which I want values to be
NA
if any of the values in a set of columns are NA
df$newcol[is.na(df$somecol_1) | is.na(df$somecol_2) | is.na(df$somecol_3)] <- NA
0
if all of the values in a set of columns are 0
df$newcol[df$somecol_1==0 & df$somecol_2==0 & df$somecol_3==0] <- 0
1
if any of the values in a set of columns are 1
while none is NA
. This is the tricky part as it creates a myriad of combinations with my ten columns. The whole data frame has >50 columns, of which I have ten columns of interest for this procedure and here I present only three:df$newcol[df$somecol_1==1 & df$somecol_2==0 & df$somecol_3==0] <- 1
df$newcol[df$somecol_1==1 & df$somecol_2==1 & df$somecol_3==0] <- 1
df$newcol[df$somecol_1==1 & df$somecol_2==1 & df$somecol_3==1] <- 1
df$newcol[df$somecol_1==0 & df$somecol_2==1 & df$somecol_3==0] <- 1
df$newcol[df$somecol_1==0 & df$somecol_2==1 & df$somecol_3==1] <- 1
df$newcol[df$somecol_1==0 & df$somecol_2==0 & df$somecol_3==1] <- 1
df$newcol[df$somecol_1==1 & df$somecol_2==0 & df$somecol_3==1] <- 1
I have a feeling I am overthinking this, there must be a way to make 3 easier? Writing different combinations of columns as shown above would take forever with ten. And a loop would go too slow due to the large dataset.
Dummy data:
df <- NULL
df$somecol_1 <- c(1,0,0,NA,0,1,0,NA,1,1)
df$somecol_2 <- c(NA,1,0,0,0,1,0,NA,0,0)
df$somecol_3 <- c(0,0,0,0,0,0,0,0,0,0)
df <- as.data.frame(df)
Based on the above, I want the new column to be
df$newcol <- c(NA,1,0,NA,0,1,0,NA,1,1)
Upvotes: 1
Views: 1413
Reputation: 4456
df$newcol = ifelse(apply(df,1,sum)>=1,1,0)
Should do the trick. First, apply
will sum every row, then:
Whenever there is NA
values in a row (1st case), any operation will return NA; When there are only 0's (2nd case), the sum is 0 (witch isn't >=1)
and the ifelse third argument makes the new entry 0; And when there is at least one 1, the sum is equal or greater than 1 and the ifelse second argument makes the new entry 1.
Edit: As you want to run theese conditions only in some columns - say its columns 1-7, 9, and 23-24 - you can use the code only in that part of the df:
df$newcol = as.numeric(rowSums(df[,c(1:7,9,23:24)])>=1)
OBS: i used the simplified code that Akrun and Gregor have answered.
If you prefer, there are ways to select columns by name: Extracting specific columns from a data frame
Upvotes: 2
Reputation: 887951
We can use rowSums
nm1 <- grep('somecol', names(df))
df$newcol <- NA^(rowSums(is.na(df[nm1])) > 0) *(rowSums(df[nm1], na.rm = TRUE) > 0)
df$newcol
#[1] NA 1 0 NA 0 1 0 NA 1 1
df <- structure(list(somecol_1 = c(1, 0, 0, NA, 0, 1, 0, NA, 1, 1),
somecol_2 = c(NA, 1, 0, 0, 0, 1, 0, NA, 0, 0), somecol_3 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA,
-10L))
Upvotes: 2
Reputation: 146224
as.numeric(rowSums(df) >= 1)
#[1] NA 1 0 NA 0 1 0 NA 1 1
rowSums
will give NA
if there are any missing values. It will be 0 if all the values are 0, it will be 1 otherwise (assuming your data is all either NA
, 0
, or 1
).
(Using akrun's sample data)
Upvotes: 2