Reputation: 345
I am learning arule, I need to convert the current data into as.matrix
I tried to take the project apart and then present 0 and 1
library(data.table)
DT <- data.table(ID=c("dog","dog","dog","cat","cat","bird"),
place=c("F-A-C","A-B-E","H-A","A-I-C-D","B-A","D-K-H-F"),
stringsAsFactors = FALSE)
I found this method, but not the result I want
library(stringr)
library(plyr)
DTa <- str_split(DT$place, "-")
DTa <- ldply(DTa ,rbind)
DT <- cbind(DT$ID, DTa)
output:
DT$ID 1 2 3 4
1 dog F A C NA
2 dog A B E NA
3 dog H A NA NA
4 cat A I C D
5 cat B A NA NA
6 bird D K H F
I hope the result is like this
DT$ID A B C D E F G H I ..... K
1 dog 1 1 1 0 1 1 0 1 0 ..... 0
2 cat 1 1 1 1 0 0 0 0 1 ..... 0
3 bird 0 0 0 1 0 1 0 1 0 ..... 1
In the original data, there may be A-I, or A-Z, or A-Q, not sure how many
And the ID is not sure how many will be there.
So I can't set the length through this.
str_split_fixed(DT$place, "-", 11)
What should I do or find out what keywords I want to do?
Thank you
Upvotes: 2
Views: 88
Reputation: 887511
An easier method would be to use cSplit
from splitstackshape
to split into 'long' format and then do a dcast
to reshape into 'wide' format while specifying thee fun.aggregate
to a logical condition based on the length
library(splitstackshape)
library(data.table)
dcast(cSplit(DT, "place", "-", 'long'),
ID ~ place, function(x) as.integer(length(x) > 0))
Or as @Frank suggested
dcast(unique(cSplit(DT, "place", "-", 'long'))[, v := 1], ID ~ place, fill=0)
or in tidyverse
, split the column with separate_rows
, get the distinct
rows, create a column of 1's and spread
to 'wide' format
library(dplyr)
library(tidyr)
DT %>%
separate_rows(place) %>%
distinct(ID, place) %>%
mutate(n = 1) %>%
spread(place, n, fill = 0)
Or in base R
, this can be done by splitting the 'place' column into a list
of vector
s, get the table
of the stack
ed list
+(table(stack(setNames(strsplit(DT$place, "-"), DT$ID))[2:1]) > 0)
Upvotes: 3
Reputation: 160607
A data.table
-only solution:
dcast(DT[, unlist(lapply(.SD, strsplit, "-")), "ID"], ID ~ V1, value.var = "V1", fun.aggregate = length)
# ID A B C D E F H I K
# 1: bird 0 0 0 1 0 1 1 0 1
# 2: cat 2 1 1 1 0 0 0 1 0
# 3: dog 3 1 1 0 1 1 1 0 0
This provides "length" instead of "yes/no". To take it to that level:
dcast(DT[, unlist(lapply(.SD, strsplit, "-")), "ID"], ID ~ V1, value.var = "V1", fun.aggregate = length)[, lapply(.SD, min, 1), by = "ID"]
# ID A B C D E F H I K
# 1: bird 0 0 0 1 0 1 1 0 1
# 2: cat 1 1 1 1 0 0 0 1 0
# 3: dog 1 1 1 0 1 1 1 0 0
I find it a little easier to see using magrittr
's pipe:
library(magrittr)
DT[, unlist(lapply(.SD, strsplit, "-")), "ID"] %>%
dcast(ID ~ V1, value.var = "V1", fun.aggregate = length) %>%
.[, lapply(.SD, min, 1), by = "ID"]
Upvotes: 2