Tom
Tom

Reputation: 714

Creating indicator variable columns in dplyr chain

Updated: With apologies to those who replied, in my original example I overlooked the fact that data.frame() created var as a factor rather than as a character vector, as I had intended. I have corrected the example, and this will break at least one of the answers.

--original--

I have a data frame that I'm performing a series of dplyr and tidyr manipulations on, and I would like to add columns for indicator variables that would be encoded as 0 or 1, and do this within the dplyr chain. Each level of a factor (presently stored as character vectors) should be encoded in a separate column, and the column names are a concatenation of a fixed prefix with the variable level, e.g. var has level a, new column var_a will be 1, and all other rows of var_a will be 0.

The following minimal example using base R produces exactly the results that I want (thanks to this blog post), but I'd like to roll it all into the dplyr chain, and can't quite figure out how to do it.

library(dplyr)
df <- data.frame(var = sample(x = letters[1:4], size = 10, replace = TRUE), stringsAsFactors = FALSE)
for(level in unique(df$var)){
  df[paste("var", level, sep = "_")] <- ifelse(df$var == level, 1, 0)
}

Note that the real data set contains multiple columns, none of which should be altered or dropped when creating the indicator variables, with the exception of the column var, which could be converted to type factor.

Upvotes: 8

Views: 5205

Answers (4)

D. Woods
D. Woods

Reputation: 3073

I landed on this Q&A first because I really wanted to put model.matrix in a magrittr pipe workflow or produce the equivalent output with just tidyverse functions (sorry, baseRs).

Later, I landed on this solution that had the elegant use of the functions that I thought was possible (but I wasn't coming up with on my own):

df <- data_frame(var = sample(x = letters[1:4], size = 10, replace = TRUE))

df %>% 
  mutate(unique_row_id = 1:n()) %>% #The rows need to be unique for `spread` to work.
  mutate(dummy = 1) %>% 
  spread(var, dummy, fill = 0)

So, I'm adding an updated/modified version of the linked solution so that people who land here first don't have to keep looking (like I did).

Upvotes: 2

Hong Ooi
Hong Ooi

Reputation: 57696

The only requirements for a function to be part of a dplyr pipeline are that it takes a data frame as input, and returns a data frame as output. So, leveraging model.matrix:

make_inds <- function(df, cols=names(df))
{
    # do each variable separately to get around model.matrix dropping aliased columns
    do.call(cbind, c(df, lapply(cols, function(n) {
        x <- df[[n]]
        mm <- model.matrix(~ x - 1)
        colnames(mm) <- gsub("^x", paste(n, "_", sep=""), colnames(mm))
        mm
    })))
}

# insert into pipeline
data %>% ... %>% make_inds %>% ...

Upvotes: 3

alistaire
alistaire

Reputation: 43354

It's possible without creating a function, although it does require lapply. If var is a factor, you can work with its levels; we can bind its columns to an lapply which loops over the levels of var and creates the values, names them with setNames, and converts them into a tbl_df.

df %>% bind_cols(as_data_frame(setNames(lapply(levels(df$var), 
                                               function(x){as.integer(df$var == x)}), 
                                        paste0('var2_', levels(df$var)))))

returns

Source: local data frame [10 x 5]

      var var_d var_c var2_c var2_d
   (fctr) (dbl) (dbl)  (int)  (int)
1       d     1     0      0      1
2       c     0     1      1      0
3       c     0     1      1      0
4       c     0     1      1      0
5       d     1     0      0      1
6       d     1     0      0      1
7       c     0     1      1      0
8       c     0     1      1      0
9       d     1     0      0      1
10      c     0     1      1      0

If var is a character vector, not a factor, you can do the same thing, but using unique instead of levels:

df %>% bind_cols(as_data_frame(setNames(lapply(unique(df$var), 
                                               function(x){as.integer(df$var == x)}), 
                                        paste0('var2_', unique(df$var)))))

Two notes:

  • This approach will work regardless of the data type, but will be slower. In your data is big enough that it matters, it likely makes sense to store the data as factor anyway, as it contains a lot of repeated levels.
  • Both versions pull data from df$var as it lives in the calling environment, not as it may exist in a larger chain, and assume var is unchanged in whatever it is passed. To reference the dynamic value of var aside from dplyr's normal NSE is rather a pain, insofar as I've seen.

One more alternative that's a little simpler and factor-agnostic, using reshape2::dcast:

library(reshape2)
df %>% cbind(1 * !is.na(dcast(df, seq_along(var) ~ var, value.var = 'var')[,-1]))

It still pulls the version of df from the calling environment, so the chain really only determines what you're joining to. Because it uses cbind instead of bind_cols, the result will be a data.frame, too, not tbl_df, so if you want to keep it all tbl_df (smart if the data is big), you'll need to replace the cbind with bind_cols(as_data_frame( ... )); bind_cols doesn't seem to want to do the conversion for you.

Note, however, that while this version is simpler, it is comparatively slower, both on factor data:

Unit: microseconds
   expr      min        lq      mean    median       uq      max neval
 factor  358.889  384.0010  479.5746  427.9685  501.580 3995.951   100
 unique  547.249  585.4205  696.4709  633.4215  696.402 4528.099   100
  dcast 2265.517 2490.5955 2721.1118 2628.0730 2824.949 3928.796   100

and string data:

Unit: microseconds
   expr      min       lq      mean    median        uq      max neval
 unique  307.190  336.422  414.1031  362.6485  419.3625 3693.340   100
  dcast 2117.807 2249.077 2517.0417 2402.4285 2615.7290 3793.178   100

For small data it won't matter, but for bigger data, it may be worth putting up with the complication.

Upvotes: 3

MrFlick
MrFlick

Reputation: 206496

It's not pretty, but this function should work

dummy <- function(data, col) {
    for(c in col) {
        idx <- which(names(data)==c)
        v <- data[[idx]]
        stopifnot(class(v)=="factor")
        m <- matrix(0, nrow=nrow(data), ncol=nlevels(v))
        m[cbind(seq_along(v), as.integer(v))]<-1
        colnames(m) <- paste(c, levels(v), sep="_")
        r <- data.frame(m)
        if ( idx>1 ) {
            r <- cbind(data[1:(idx-1)],r)
        }
        if ( idx<ncol(data) ) {
            r <- cbind(r, data[(idx+1):ncol(data)])
        }
        data <- r
    }
    data
}

Here's a sample data.frame

dd <- data.frame(a=runif(30),
    b=sample(letters[1:3],30,replace=T),
    c=rnorm(30),
    d=sample(letters[10:13],30,replace=T)
)

and you specify the columns you want to expand as a character vector. You can do

dd %>% dummy("b")

or

dd %>% dummy(c("b","d"))

Upvotes: 5

Related Questions