Reputation: 547
n = 1:5
lett = LETTERS[1:5]
value = character(length = 5)
size = numeric(length = 5)
for (i in 1:5) {
set.seed(i)
size[i] = sample(1:5, 1)
set.seed(i)
value[i] = paste(sample(lett, size[i]), collapse = ";")
}
dat = data.frame(n, value)
dat
> dat
n value
1 1 B;E
2 2 A
3 3 A
4 4 C;A;D
5 5 B;C
The data.frame is as above. I wish to clean the data.frame in the format of:
n A B C D E
1 No Yes No No Yes
2 ...
3 ...
4 ...
5 ...
What should I do? (suppose there are more than 5 categories in values and I do not know how many categories before cleaning the data)
Upvotes: 1
Views: 76
Reputation: 887028
We can split the 'value' column, get the frequency with mtabulate
for each of the unique elements, convert to a numeric index matrix and replace the values with 'No' and 'Yes'
library(qdapTools)
m1 <- (mtabulate(strsplit(as.character(dat$value), ";"))!=0)+1
m1[] <- c("No", "Yes")[m1]
data.frame(n = 1:nrow(m1), m1)
# n A B C D E
#1 1 No Yes No No Yes
#2 2 Yes No No No No
#3 3 Yes No No No No
#4 4 Yes No Yes Yes No
#5 5 No Yes Yes No No
Upvotes: 2