Reputation: 1285
Here is my dataframe
data<-data.frame(
ID=c(1:8),
Diag1=c(1,0,1,0,1,0,1,0),
Diag2=c(0,1,0,1,0,0,1,0),
Diag3=c(0,0,0,1,0,1,1,0),
Multiple.Diag=c(0,0,1,1,0,0,1,0)
)
I have patients with different diagnoses, some of them have multiple diagnoses. These diagnoses are dummy variables that need to be converted into categorical variables. If the patient has Mult.diag==1
, his diagnosis will be Multiple.diag
, otherwise his diagnosis will be either Diag1
, Diag2
or Diag3
. If the patient has 0 for the whole variables, the diagnosis will be "Other"
.
Here is what I want to have:
ID Diagnosis
1 1 Diag1
2 2 Diag2
3 3 Multiple.Diag
4 4 Multiple.Diag
5 5 Diag1
6 6 Diag3
7 7 Multiple.Diag
8 8 Other
Upvotes: 0
Views: 826
Reputation: 887951
Here is an option with max.col
to find the index of the column with the highest value in each row of subset of 'Diag' columns, use that index to get the column name, then change those elements to 'Multiple.Diag' where it is 1 and if there are no 1s in a row, we categorize it as 'Other' using a logical expression created with rowSums
-codes
out <- cbind(data['ID'], Diagnosis = names(data)[-1][
max.col(data[startsWith(names(data), 'Diag')], 'first')])
i1 <- as.logical(data$Multiple.Diag)
out$Diagnosis[i1] <- 'Multiple.Diag'
i2 <- rowSums(data[-1]) == 0
out$Diagnosis[i2] <- 'Other'
-output
out
# ID Diagnosis
#1 1 Diag1
#2 2 Diag2
#3 3 Multiple.Diag
#4 4 Multiple.Diag
#5 5 Diag1
#6 6 Diag3
#7 7 Multiple.Diag
#8 8 Other
max.col
is vectorized and should be very fast
Or another option is to use which
with arr.ind = TRUE
m1 <- which(data[-1] == 1, arr.ind = TRUE)
cbind(data['ID'], Diagnosis = ifelse(rowSums(data[-1]) == 0,
'Other', names(data)[-1][tapply(m1[, 2], m1[,1], FUN = max)]))
-output
# ID Diagnosis
#1 1 Diag1
#2 2 Diag2
#3 3 Multiple.Diag
#4 4 Multiple.Diag
#5 5 Diag1
#6 6 Diag3
#7 7 Multiple.Diag
#8 8 Other
Or using tidyverse
using the same method
library(dplyr)
data %>%
transmute(ID, Diagnosis = case_when(rowSums(.[-1]) == 0 ~ "Other",
TRUE ~ names(.)[-1][max.col(.[-1], 'last')]))
-output
# ID Diagnosis
#1 1 Diag1
#2 2 Diag2
#3 3 Multiple.Diag
#4 4 Multiple.Diag
#5 5 Diag1
#6 6 Diag3
#7 7 Multiple.Diag
#8 8 Other
Or another option where we use rowwise
along with c_across
data %>%
rowwise %>%
transmute(ID, Diagnosis = coalesce(case_when(as.logical(Multiple.Diag)
~ 'Multiple.Diag',
TRUE ~ names(.)[-1][as.logical(c_across(-1))][1] ), 'Other'))
-output
# A tibble: 8 x 2
# Rowwise:
# ID Diagnosis
# <int> <chr>
#1 1 Diag1
#2 2 Diag2
#3 3 Multiple.Diag
#4 4 Multiple.Diag
#5 5 Diag1
#6 6 Diag3
#7 7 Multiple.Diag
#8 8 Other
Upvotes: 1
Reputation: 79348
with tidyverse you could also do:
data %>%
pivot_longer(-ID) %>%
group_by(ID) %>%
slice(which.max(as.integer(factor(name))*value))%>%
mutate(name = if_else(value == 0, 'other',name), value= NULL)
# A tibble: 8 x 2
# Groups: ID [8]
ID name
<int> <chr>
1 1 Diag1
2 2 Diag2
3 3 Multiple.Diag
4 4 Multiple.Diag
5 5 Diag1
6 6 Diag3
7 7 Multiple.Diag
8 8 other
Upvotes: 1
Reputation: 39613
You can use apply()
and built-in function like this. You can add the result to your original data
or saving in a new dataframe. The function myfunc
extract the name of the variable for treatment based on the directions you mentioned. Here the code:
#Code
myfunc <- function(x)
{
y <- names(x)[max(which(x==1))]
if(is.na(y))
{
y <- 'Others'
}
return(y)
}
#Add var
data$Var <- apply(data[,-1],1,myfunc)
Output:
ID Diag1 Diag2 Diag3 Multiple.Diag Var
1 1 1 0 0 0 Diag1
2 2 0 1 0 0 Diag2
3 3 1 0 0 1 Multiple.Diag
4 4 0 1 1 1 Multiple.Diag
5 5 1 0 0 0 Diag1
6 6 0 0 1 0 Diag3
7 7 1 1 1 1 Multiple.Diag
8 8 0 0 0 0 Others
Upvotes: 1