Reputation: 87
I am trying to get the dummy variables for the following table:
df:
Value1 var1 var2 var3 var4
9.330154398 HomeATL AwayHOU HomeEast AwayWest
32.43881489 AwaySDN HomeATL HomeWest AwayWest
54.77178387 AwayLAN HomeATL AwayEast HomeSame
54.77178387 AwayLAN HomeATL AwayWest HomeEast
The column var1
and var2
share the same levels. On the other hand, the column var3
and var4
whare their levels as well. Therefore, I need that during the creation of the dummy variables, the new columns created should not have repeated levels. I mean, in the example of var3 and var4, for the 1st and 3rd row, both have AwayWest
, so I only need to fill 1 column named AwayWest
with a number 1 on each row.
My desired output is:
Value1 HomeEast HomeWest AwayEast AwayWest HomeSame HomeATL AwayHOU AwaySDN AwayLAN
9.330154398 1 0 0 1 0 1 1 0 0
32.43881489 0 1 0 1 0 1 0 1 0
54.77178387 0 0 1 0 1 1 0 0 1
54.77178387 1 0 0 1 0 1 0 0 1
I tried creating a new column of 1 (col1
) for each column to be transformed:
spread(df,var1, col1) %>%
spread(var2, col1)%>%
spread(var3, col1)%>%
spread(var1, col1)
However it is not working.
Thanks
Upvotes: 0
Views: 197
Reputation: 2764
You can also do something like this-
> data.table::setDT(df)[,id:=1:.N]
> cbind(df[,.(Value1)],dcast(
melt(setDT(df)[, c(.(id=id), lapply(c("var1","var2","var3","var4"), function(x) paste0(x, get(x))))], id.vars="id"),
id ~ value,
length))
Output-
Value1 id var1AwayLAN var1AwaySDN var1HomeATL var2AwayHOU var2HomeATL var3AwayEast var3AwayWest var3HomeEast var3HomeWest
1: 9.330154 1 0 0 1 1 0 0 0 1 0
2: 32.438815 2 0 1 0 0 1 0 0 0 1
3: 54.771784 3 1 0 0 0 1 1 0 0 0
4: 54.771784 4 1 0 0 0 1 0 1 0 0
Upvotes: 0
Reputation: 887961
A base R
option would be table
tbl <- +(table(c(col(df1[-1])), unlist(df1[-1]) ) > 0)
cbind(df1[1], as.data.frame.matrix(tbl))
Or using tidyverse
library(tidyverse)
rownames_to_column(df1, 'rn') %>%
gather(key, val, var1:var4) %>%
count(rn, val) %>%
spread(val, n, fill = 0) %>%
select(-rn) %>%
bind_cols(df1[1], .)
df1 <- structure(list(Value1 = c(9.330154398, 32.43881489, 54.77178387,
54.77178387), var1 = c("HomeATL", "AwaySDN", "AwayLAN", "AwayLAN"
), var2 = c("AwayHOU", "HomeATL", "HomeATL", "HomeATL"), var3 = c("HomeEast",
"HomeWest", "AwayEast", "AwayWest"), var4 = c("AwayWest", "AwayWest",
"HomeSame", "HomeEast")), class = "data.frame", row.names = c(NA,
-4L))
Upvotes: 2
Reputation: 50738
A base R option would be to use model.matrix
df <- cbind(df[, "Value1", drop = F], model.matrix(Value1 ~ . - 1, data = df))
df
# Value1 var1AwayLAN var1AwaySDN var1HomeATL var2HomeATL var3AwayWest
#1 9.330154 0 0 1 0 0
#2 32.438815 0 1 0 1 0
#3 54.771784 1 0 0 1 0
#4 54.771784 1 0 0 1 1
# var3HomeEast var3HomeWest var4HomeEast var4HomeSame
#1 1 0 0 0
#2 0 1 0 0
#3 0 0 0 1
#4 0 0 1 0
If necessary, we can fix the column names with
names(df) <- sub("var\\d", "", names(df))
to reproduce your expected output.
df <- read.table(text =
"Value1 var1 var2 var3 var4
9.330154398 HomeATL AwayHOU HomeEast AwayWest
32.43881489 AwaySDN HomeATL HomeWest AwayWest
54.77178387 AwayLAN HomeATL AwayEast HomeSame
54.77178387 AwayLAN HomeATL AwayWest HomeEast", header = T)
Upvotes: 3