Reputation: 1427
I feel like this should be easier.
Assume I have a field that contains a "multivalued" item, e.g. genres of a movie.
I want to break those out into dummies, with the rows that have more than one item getting a dummy in each.
How do I do that in a nice, convenient way?
library(tidyverse)
data <- tribble(
~column,
"var1",
"var1 / var2",
"var2",
"var3",
"var1 / var3",
"var2 / var3"
)
data %>%
separate(column, into = c("item1", "item2"), sep = " / ", fill = "right") %>%
mutate_each(funs(factor(., levels = c("var1", "var2", "var3")))) %>%
mutate(row = as.factor(row_number())) ->
intermediate
head(intermediate)
#> # A tibble: 6 × 3
#> item1 item2 row
#> <fctr> <fctr> <fctr>
#> 1 var1 NA 1
#> 2 var1 var2 2
#> 3 var2 NA 3
#> 4 var3 NA 4
#> 5 var1 var3 5
#> 6 var2 var3 6
v1 <- xtabs( ~ row + item1, data = intermediate)
v2 <- xtabs( ~ row + item2, data = intermediate)
combined <- v1 + v2
combined
#> item1
#> row var1 var2 var3
#> 1 1 0 0
#> 2 1 1 0
#> 3 0 1 0
#> 4 0 0 1
#> 5 1 0 1
#> 6 0 1 1
That feels really un-R-like.
This is pretty easy to do in Python with sklearn
's DictVectorizer
. For instance:
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
d = [
"var1",
"var1 / var2",
"var2",
"var3",
"var1 / var3",
"var2 / var3"
]
data = pd.DataFrame(d, columns = ["column"])
col = data.column.str.split(" / ")
col = col.apply(lambda row: {key: 1 for key in row})
transformer = DictVectorizer()
transformer.fit_transform(col).todense()
#> matrix([[ 1., 0., 0.],
#> [ 1., 1., 0.],
#> [ 0., 1., 0.],
#> [ 0., 0., 1.],
#> [ 1., 0., 1.],
#> [ 0., 1., 1.]])
I'm really just looking for a "tidy" equivalent in R-land.
Upvotes: 2
Views: 301
Reputation: 14202
you can use splitstackshape
x<-c("var1",
"var1 / var2",
"var2",
"var3",
"var1 / var3",
"var2 / var3"
)
library(splitstackshape)
splitstackshape:::charMat(strsplit(x, " / "), 0)
var1 var2 var3
[1,] 1 0 0
[2,] 1 1 0
[3,] 0 1 0
[4,] 0 0 1
[5,] 1 0 1
[6,] 0 1 1
Upvotes: 3