Greenstick
Greenstick

Reputation: 9037

Converting Column Values into Their Own Binary Encoded Columns (Dummy Variables)

I have a number of CSV files with columns such as gender, age, diagnosis, etc.

Currently, they are coded as such:

ID, gender, age, diagnosis
1,  male,   42,  asthma
1,  male,   42,  anxiety
2,  male,   19,  asthma
3,  female, 23,  diabetes
4,  female, 61,  diabetes
4,  female, 61,  copd

The goal is to transform this data into this target format:

Sidenote: if possible, it would be great to also prepend the original column names to the new column names, e.g. 'age_42' or 'gender_female.'

ID, male, female, 42, 19, 23, 61, asthma, anxiety, diabetes, copd
1,  1,    0,      1,  0,  0,  0,  1,      1,       0,        0
2,  1,    0,      0,  1,  0,  0,  1,      0,       0,        0
3,  0,    1,      0,  0,  1,  0,  0,      0,       1,        0
4,  0,    1,      0,  0,  0,  1,  0,      0,       1,        1 

I've attempted using reshape2's dcast() function but am getting combinations resulting in extremely sparse matrices. Here's a simplified example with just age and gender:

data.train  <- dcast(data.raw, formula = id ~ gender + age, fun.aggregate = length)

ID, male19, male23, male42, male61, female19, female23, female42, female61
1,  0,      0,      1,      0,      0,        0,        0,        0
2,  1,      0,      0,      0,      0,        0,        0,        0
3,  0,      0,      0,      0,      0,        1,        0,        0
4,  0,      0,      0,      0,      0,        0,        0,        1   

Seeing as this is a fairly common task in machine learning data preparation, I imagine there may be other libraries (that I'm unaware of) that are able to perform this transformation.

Upvotes: 7

Views: 3504

Answers (5)

Jaehyeon Kim
Jaehyeon Kim

Reputation: 1417

Below is a slightly longer way with dcast() and merge(). As gender and age are not unique by ID, a function is created to turn its length into a dummy variable (dum()). On the other hand, diagnosis is set to be counted uniquely by adjusting the formula.

library(reshape2)
data.raw <- read.table(header = T, sep = ",", text = "
id, gender, age, diagnosis
1,  male,   42,  asthma
1,  male,   42,  anxiety
2,  male,   19,  asthma
3,  female, 23,  diabetes
4,  female, 61,  diabetes
4,  female, 61,  copd")

# function to create a dummy variable
dum <- function(x) { if(length(x) > 0) 1 else 0 }

# length of dignosis by id, gender and age
diag <- dcast(data.raw, formula = id + gender + age ~ diagnosis, fun.aggregate = length)[,-c(2,3)]

# length of gender by id
gen <- dcast(data.raw, formula = id ~ gender, fun.aggregate = dum)

# length of age by id
age <- dcast(data.raw, formula = id ~ age, fun.aggregate = dum)

merge(merge(gen, age, by = "id"), diag, by = "id")
#  id   female   male 19 23 42 61   anxiety   asthma   copd   diabetes
#1  1        0      1  0  0  1  0         1        1      0          0
#2  2        0      1  1  0  0  0         0        1      0          0
#3  3        1      0  0  1  0  0         0        0      0          1
#4  4        1      0  0  0  0  1         0        0      1          1

Actually I'm not well aware of your model but your setting may be too much as R handles factors by the formula object. For example, if gender is the response, the following matrix will be generated within R. Therefore, as long as you are not going to fit on your own, it'd suffice to set data types and formula appropriately.

data.raw$age <- as.factor(data.raw$age)
model.matrix(gender ~ ., data = data.raw[,-1])
#(Intercept) age23 age42 age61 diagnosis  asthma diagnosis  copd diagnosis  diabetes
#1           1     0     1     0                 1               0                   0
#2           1     0     1     0                 0               0                   0
#3           1     0     0     0                 1               0                   0
#4           1     1     0     0                 0               0                   1
#5           1     0     0     1                 0               0                   1
#6           1     0     0     1                 0               1                   0

If you need all the levels of each variable you can do this by suppressing the intercept in model.matrix and using a wee trick from all-levels-of-a-factor-in-a-model-matrix-in-r

#  Using Akrun's df1, first change all variables, except ID, to factor
df1[-1] <- lapply(df1[-1], factor)

# Use model.matrix to derive dummy coding
m <- data.frame(model.matrix( ~ 0 + . , data=df1, 
             contrasts.arg = lapply(df1[-1], contrasts, contrasts=FALSE)))

# Collapse to give final solution
aggregate(. ~ ID, data=m, max)

Upvotes: 1

akrun
akrun

Reputation: 887223

A base R option would be

 (!!table(cbind(df1[1],stack(df1[-1])[-2])))*1L
 #     values
 #ID  19 23 42 61 anxiety asthma copd diabetes female male
 # 1  0  0  1  0       1      1    0        0      0    1
 # 2  1  0  0  0       0      1    0        0      0    1
 # 3  0  1  0  0       0      0    0        1      1    0
 # 4  0  0  0  1       0      0    1        1      1    0

If you need the original name as well

 (!!table(cbind(df1[1],Val=do.call(paste, c(stack(df1[-1])[2:1], sep="_")))))*1L
 #   Val
 #ID  age_19 age_23 age_42 age_61 diagnosis_anxiety diagnosis_asthma
 #1      0      0      1      0                 1                1
 #2      1      0      0      0                 0                1
 #3      0      1      0      0                 0                0
 #4      0      0      0      1                 0                0
 #  Val
 #ID  diagnosis_copd diagnosis_diabetes gender_female gender_male
 #1              0                  0             0           1
 #2              0                  0             0           1
 #3              0                  1             1           0
 #4              1                  1             1           0

data

df1 <- structure(list(ID = c(1L, 1L, 2L, 3L, 4L, 4L), gender = c("male", 
"male", "male", "female", "female", "female"), age = c(42L, 42L, 
19L, 23L, 61L, 61L), diagnosis = c("asthma", "anxiety", "asthma", 
"diabetes", "diabetes", "copd")), .Names = c("ID", "gender", 
"age", "diagnosis"), row.names = c(NA, -6L), class = "data.frame")

Upvotes: 4

Jota
Jota

Reputation: 17611

Using reshape from base R:

d <- reshape(df, idvar="ID", timevar="diagnosis", direction="wide", v.names="diagnosis", sep="_")
a <- reshape(df, idvar="ID", timevar="age", direction="wide", v.names="age", sep="_")
g <- reshape(df, idvar="ID", timevar="gender", direction="wide", v.names="gender", sep="_")


new.dat <- cbind(ID=d["ID"],
    g[,grepl("_", names(g))],
    a[,grepl("_", names(a))],
    d[,grepl("_", names(d))])

# convert factors columns to character (if necessary)
# taken from @Marek's answer here: http://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters/2853231#2853231
new.dat[sapply(new.dat, is.factor)] <- lapply(new.dat[sapply(new.dat, is.factor)], as.character)

new.dat[which(is.na(new.dat), arr.ind=TRUE)] <- 0
new.dat[-1][which(new.dat[-1] != 0, arr.ind=TRUE)] <- 1

#  ID gender_male gender_female age_42 age_19 age_23 age_61 diagnosis_asthma
#1  1           1             0      1      0      0      0                1
#3  2           1             0      0      1      0      0                1
#4  3           0             1      0      0      1      0                0
#5  4           0             1      0      0      0      1                0
#  diagnosis_anxiety diagnosis_diabetes diagnosis_copd
#1                 1                  0              0
#3                 0                  0              0
#4                 0                  1              0
#5                 0                  1              1

Upvotes: 3

Steven Beaupr&#233;
Steven Beaupr&#233;

Reputation: 21621

There is a function in the caret package to "dummify" data.

library(caret)
library(dplyr)
predict(dummyVars(~ ., data = mutate_each(df, funs(as.factor))), newdata = df)

Upvotes: 5

David Arenburg
David Arenburg

Reputation: 92292

You need a melt/dcast combination here (which called recast) in order to convert all columns into one column and avoid combinations

library(reshape2)
recast(df, ID ~ value, id.var = 1, fun.aggregate = function(x) (length(x) > 0) + 0L)
#   ID 19 23 42 61 anxiety asthma copd diabetes female male
# 1  1  0  0  1  0       1      1    0        0      0    1
# 2  2  1  0  0  0       0      1    0        0      0    1
# 3  3  0  1  0  0       0      0    0        1      1    0
# 4  4  0  0  0  1       0      0    1        1      1    0

As per your Sidenote, you can add variable here in order to get the names added too

recast(df, ID ~ variable + value, id.var = 1, fun.aggregate = function(x) (length(x) > 0) + 0L)
#   ID gender_female gender_male age_19 age_23 age_42 age_61 diagnosis_anxiety diagnosis_asthma diagnosis_copd
# 1  1             0           1      0      0      1      0                 1                1              0
# 2  2             0           1      1      0      0      0                 0                1              0
# 3  3             1           0      0      1      0      0                 0                0              0
# 4  4             1           0      0      0      0      1                 0                0              1
#   diagnosis_diabetes
# 1                  0
# 2                  0
# 3                  1
# 4                  1

Upvotes: 8

Related Questions