Reputation: 9037
I have a number of CSV files with columns such as gender, age, diagnosis, etc.
Currently, they are coded as such:
ID, gender, age, diagnosis
1, male, 42, asthma
1, male, 42, anxiety
2, male, 19, asthma
3, female, 23, diabetes
4, female, 61, diabetes
4, female, 61, copd
The goal is to transform this data into this target format:
Sidenote: if possible, it would be great to also prepend the original column names to the new column names, e.g. 'age_42' or 'gender_female.'
ID, male, female, 42, 19, 23, 61, asthma, anxiety, diabetes, copd
1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0
2, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0
3, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0
4, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1
I've attempted using reshape2's dcast()
function but am getting combinations resulting in extremely sparse matrices. Here's a simplified example with just age and gender:
data.train <- dcast(data.raw, formula = id ~ gender + age, fun.aggregate = length)
ID, male19, male23, male42, male61, female19, female23, female42, female61
1, 0, 0, 1, 0, 0, 0, 0, 0
2, 1, 0, 0, 0, 0, 0, 0, 0
3, 0, 0, 0, 0, 0, 1, 0, 0
4, 0, 0, 0, 0, 0, 0, 0, 1
Seeing as this is a fairly common task in machine learning data preparation, I imagine there may be other libraries (that I'm unaware of) that are able to perform this transformation.
Upvotes: 7
Views: 3504
Reputation: 1417
Below is a slightly longer way with dcast()
and merge()
. As gender and age are not unique by ID, a function is created to turn its length into a dummy variable (dum()
). On the other hand, diagnosis is set to be counted uniquely by adjusting the formula.
library(reshape2)
data.raw <- read.table(header = T, sep = ",", text = "
id, gender, age, diagnosis
1, male, 42, asthma
1, male, 42, anxiety
2, male, 19, asthma
3, female, 23, diabetes
4, female, 61, diabetes
4, female, 61, copd")
# function to create a dummy variable
dum <- function(x) { if(length(x) > 0) 1 else 0 }
# length of dignosis by id, gender and age
diag <- dcast(data.raw, formula = id + gender + age ~ diagnosis, fun.aggregate = length)[,-c(2,3)]
# length of gender by id
gen <- dcast(data.raw, formula = id ~ gender, fun.aggregate = dum)
# length of age by id
age <- dcast(data.raw, formula = id ~ age, fun.aggregate = dum)
merge(merge(gen, age, by = "id"), diag, by = "id")
# id female male 19 23 42 61 anxiety asthma copd diabetes
#1 1 0 1 0 0 1 0 1 1 0 0
#2 2 0 1 1 0 0 0 0 1 0 0
#3 3 1 0 0 1 0 0 0 0 0 1
#4 4 1 0 0 0 0 1 0 0 1 1
Actually I'm not well aware of your model but your setting may be too much as R handles factors by the formula object. For example, if gender is the response, the following matrix will be generated within R. Therefore, as long as you are not going to fit on your own, it'd suffice to set data types and formula appropriately.
data.raw$age <- as.factor(data.raw$age)
model.matrix(gender ~ ., data = data.raw[,-1])
#(Intercept) age23 age42 age61 diagnosis asthma diagnosis copd diagnosis diabetes
#1 1 0 1 0 1 0 0
#2 1 0 1 0 0 0 0
#3 1 0 0 0 1 0 0
#4 1 1 0 0 0 0 1
#5 1 0 0 1 0 0 1
#6 1 0 0 1 0 1 0
If you need all the levels of each variable you can do this by suppressing the intercept in model.matrix
and using a wee trick from all-levels-of-a-factor-in-a-model-matrix-in-r
# Using Akrun's df1, first change all variables, except ID, to factor
df1[-1] <- lapply(df1[-1], factor)
# Use model.matrix to derive dummy coding
m <- data.frame(model.matrix( ~ 0 + . , data=df1,
contrasts.arg = lapply(df1[-1], contrasts, contrasts=FALSE)))
# Collapse to give final solution
aggregate(. ~ ID, data=m, max)
Upvotes: 1
Reputation: 887223
A base R
option would be
(!!table(cbind(df1[1],stack(df1[-1])[-2])))*1L
# values
#ID 19 23 42 61 anxiety asthma copd diabetes female male
# 1 0 0 1 0 1 1 0 0 0 1
# 2 1 0 0 0 0 1 0 0 0 1
# 3 0 1 0 0 0 0 0 1 1 0
# 4 0 0 0 1 0 0 1 1 1 0
If you need the original name as well
(!!table(cbind(df1[1],Val=do.call(paste, c(stack(df1[-1])[2:1], sep="_")))))*1L
# Val
#ID age_19 age_23 age_42 age_61 diagnosis_anxiety diagnosis_asthma
#1 0 0 1 0 1 1
#2 1 0 0 0 0 1
#3 0 1 0 0 0 0
#4 0 0 0 1 0 0
# Val
#ID diagnosis_copd diagnosis_diabetes gender_female gender_male
#1 0 0 0 1
#2 0 0 0 1
#3 0 1 1 0
#4 1 1 1 0
df1 <- structure(list(ID = c(1L, 1L, 2L, 3L, 4L, 4L), gender = c("male",
"male", "male", "female", "female", "female"), age = c(42L, 42L,
19L, 23L, 61L, 61L), diagnosis = c("asthma", "anxiety", "asthma",
"diabetes", "diabetes", "copd")), .Names = c("ID", "gender",
"age", "diagnosis"), row.names = c(NA, -6L), class = "data.frame")
Upvotes: 4
Reputation: 17611
Using reshape
from base R:
d <- reshape(df, idvar="ID", timevar="diagnosis", direction="wide", v.names="diagnosis", sep="_")
a <- reshape(df, idvar="ID", timevar="age", direction="wide", v.names="age", sep="_")
g <- reshape(df, idvar="ID", timevar="gender", direction="wide", v.names="gender", sep="_")
new.dat <- cbind(ID=d["ID"],
g[,grepl("_", names(g))],
a[,grepl("_", names(a))],
d[,grepl("_", names(d))])
# convert factors columns to character (if necessary)
# taken from @Marek's answer here: http://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters/2853231#2853231
new.dat[sapply(new.dat, is.factor)] <- lapply(new.dat[sapply(new.dat, is.factor)], as.character)
new.dat[which(is.na(new.dat), arr.ind=TRUE)] <- 0
new.dat[-1][which(new.dat[-1] != 0, arr.ind=TRUE)] <- 1
# ID gender_male gender_female age_42 age_19 age_23 age_61 diagnosis_asthma
#1 1 1 0 1 0 0 0 1
#3 2 1 0 0 1 0 0 1
#4 3 0 1 0 0 1 0 0
#5 4 0 1 0 0 0 1 0
# diagnosis_anxiety diagnosis_diabetes diagnosis_copd
#1 1 0 0
#3 0 0 0
#4 0 1 0
#5 0 1 1
Upvotes: 3
Reputation: 21621
There is a function in the caret
package to "dummify" data.
library(caret)
library(dplyr)
predict(dummyVars(~ ., data = mutate_each(df, funs(as.factor))), newdata = df)
Upvotes: 5
Reputation: 92292
You need a melt
/dcast
combination here (which called recast
) in order to convert all columns into one column and avoid combinations
library(reshape2)
recast(df, ID ~ value, id.var = 1, fun.aggregate = function(x) (length(x) > 0) + 0L)
# ID 19 23 42 61 anxiety asthma copd diabetes female male
# 1 1 0 0 1 0 1 1 0 0 0 1
# 2 2 1 0 0 0 0 1 0 0 0 1
# 3 3 0 1 0 0 0 0 0 1 1 0
# 4 4 0 0 0 1 0 0 1 1 1 0
As per your Sidenote, you can add variable
here in order to get the names added too
recast(df, ID ~ variable + value, id.var = 1, fun.aggregate = function(x) (length(x) > 0) + 0L)
# ID gender_female gender_male age_19 age_23 age_42 age_61 diagnosis_anxiety diagnosis_asthma diagnosis_copd
# 1 1 0 1 0 0 1 0 1 1 0
# 2 2 0 1 1 0 0 0 0 1 0
# 3 3 1 0 0 1 0 0 0 0 0
# 4 4 1 0 0 0 0 1 0 0 1
# diagnosis_diabetes
# 1 0
# 2 0
# 3 1
# 4 1
Upvotes: 8