Reputation: 714
Updated: With apologies to those who replied, in my original example I overlooked the fact that data.frame()
created var
as a factor rather than as a character vector, as I had intended. I have corrected the example, and this will break at least one of the answers.
--original--
I have a data frame that I'm performing a series of dplyr and tidyr manipulations on, and I would like to add columns for indicator variables that would be encoded as 0 or 1, and do this within the dplyr chain. Each level of a factor (presently stored as character vectors) should be encoded in a separate column, and the column names are a concatenation of a fixed prefix with the variable level, e.g. var
has level a, new column var_a
will be 1, and all other rows of var_a
will be 0.
The following minimal example using base R produces exactly the results that I want (thanks to this blog post), but I'd like to roll it all into the dplyr chain, and can't quite figure out how to do it.
library(dplyr)
df <- data.frame(var = sample(x = letters[1:4], size = 10, replace = TRUE), stringsAsFactors = FALSE)
for(level in unique(df$var)){
df[paste("var", level, sep = "_")] <- ifelse(df$var == level, 1, 0)
}
Note that the real data set contains multiple columns, none of which should be altered or dropped when creating the indicator variables, with the exception of the column var
, which could be converted to type factor.
Upvotes: 8
Views: 5205
Reputation: 3073
I landed on this Q&A first because I really wanted to put model.matrix
in a magrittr pipe workflow or produce the equivalent output with just tidyverse functions (sorry, baseRs).
Later, I landed on this solution that had the elegant use of the functions that I thought was possible (but I wasn't coming up with on my own):
df <- data_frame(var = sample(x = letters[1:4], size = 10, replace = TRUE))
df %>%
mutate(unique_row_id = 1:n()) %>% #The rows need to be unique for `spread` to work.
mutate(dummy = 1) %>%
spread(var, dummy, fill = 0)
So, I'm adding an updated/modified version of the linked solution so that people who land here first don't have to keep looking (like I did).
Upvotes: 2
Reputation: 57696
The only requirements for a function to be part of a dplyr pipeline are that it takes a data frame as input, and returns a data frame as output. So, leveraging model.matrix
:
make_inds <- function(df, cols=names(df))
{
# do each variable separately to get around model.matrix dropping aliased columns
do.call(cbind, c(df, lapply(cols, function(n) {
x <- df[[n]]
mm <- model.matrix(~ x - 1)
colnames(mm) <- gsub("^x", paste(n, "_", sep=""), colnames(mm))
mm
})))
}
# insert into pipeline
data %>% ... %>% make_inds %>% ...
Upvotes: 3
Reputation: 43354
It's possible without creating a function, although it does require lapply
. If var
is a factor, you can work with its levels; we can bind its columns to an lapply
which loops over the levels of var
and creates the values, names them with setNames
, and converts them into a tbl_df
.
df %>% bind_cols(as_data_frame(setNames(lapply(levels(df$var),
function(x){as.integer(df$var == x)}),
paste0('var2_', levels(df$var)))))
returns
Source: local data frame [10 x 5]
var var_d var_c var2_c var2_d
(fctr) (dbl) (dbl) (int) (int)
1 d 1 0 0 1
2 c 0 1 1 0
3 c 0 1 1 0
4 c 0 1 1 0
5 d 1 0 0 1
6 d 1 0 0 1
7 c 0 1 1 0
8 c 0 1 1 0
9 d 1 0 0 1
10 c 0 1 1 0
If var
is a character vector, not a factor, you can do the same thing, but using unique
instead of levels
:
df %>% bind_cols(as_data_frame(setNames(lapply(unique(df$var),
function(x){as.integer(df$var == x)}),
paste0('var2_', unique(df$var)))))
Two notes:
factor
anyway, as it contains a lot of repeated levels.df$var
as it lives in the calling environment, not as it may exist in a larger chain, and assume var
is unchanged in whatever it is passed. To reference the dynamic value of var
aside from dplyr
's normal NSE is rather a pain, insofar as I've seen.One more alternative that's a little simpler and factor
-agnostic, using reshape2::dcast
:
library(reshape2)
df %>% cbind(1 * !is.na(dcast(df, seq_along(var) ~ var, value.var = 'var')[,-1]))
It still pulls the version of df
from the calling environment, so the chain really only determines what you're joining to. Because it uses cbind
instead of bind_cols
, the result will be a data.frame
, too, not tbl_df
, so if you want to keep it all tbl_df
(smart if the data is big), you'll need to replace the cbind
with bind_cols(as_data_frame( ... ))
; bind_cols
doesn't seem to want to do the conversion for you.
Note, however, that while this version is simpler, it is comparatively slower, both on factor
data:
Unit: microseconds
expr min lq mean median uq max neval
factor 358.889 384.0010 479.5746 427.9685 501.580 3995.951 100
unique 547.249 585.4205 696.4709 633.4215 696.402 4528.099 100
dcast 2265.517 2490.5955 2721.1118 2628.0730 2824.949 3928.796 100
and string data:
Unit: microseconds
expr min lq mean median uq max neval
unique 307.190 336.422 414.1031 362.6485 419.3625 3693.340 100
dcast 2117.807 2249.077 2517.0417 2402.4285 2615.7290 3793.178 100
For small data it won't matter, but for bigger data, it may be worth putting up with the complication.
Upvotes: 3
Reputation: 206496
It's not pretty, but this function should work
dummy <- function(data, col) {
for(c in col) {
idx <- which(names(data)==c)
v <- data[[idx]]
stopifnot(class(v)=="factor")
m <- matrix(0, nrow=nrow(data), ncol=nlevels(v))
m[cbind(seq_along(v), as.integer(v))]<-1
colnames(m) <- paste(c, levels(v), sep="_")
r <- data.frame(m)
if ( idx>1 ) {
r <- cbind(data[1:(idx-1)],r)
}
if ( idx<ncol(data) ) {
r <- cbind(r, data[(idx+1):ncol(data)])
}
data <- r
}
data
}
Here's a sample data.frame
dd <- data.frame(a=runif(30),
b=sample(letters[1:3],30,replace=T),
c=rnorm(30),
d=sample(letters[10:13],30,replace=T)
)
and you specify the columns you want to expand as a character vector. You can do
dd %>% dummy("b")
or
dd %>% dummy(c("b","d"))
Upvotes: 5