Rajarshi Bhadra
Rajarshi Bhadra

Reputation: 1944

Frequency of Characters in Strings as columns in data frame using R

I have a data frame initial of the following format

> head(initial)
      Strings
1     A,A,B,C
2       A,B,C
3 A,A,A,A,A,B
4     A,A,B,C
5       A,B,C
6 A,A,A,A,A,B

and the data frame I want is final

    > head(final)
      Strings A B C
1     A,A,B,C 2 1 1
2       A,B,C 1 1 1
3 A,A,A,A,A,B 5 1 0
4     A,A,B,C 2 1 1
5       A,B,C 1 1 1
6 A,A,A,A,A,B 5 1 0

to generate the data frames the following codes can be used to keep the number of rows high

initial<-data.frame(Strings=rep(c("A,A,B,C","A,B,C","A,A,A,A,A,B"),100))
final<-data.frame(Strings=rep(c("A,A,B,C","A,B,C","A,A,A,A,A,B"),100),A=rep(c(2,1,5),100),B=rep(c(1,1,1),100),C=rep(c(1,1,0),100))

What is the fastest way I can achieve this? Any help will be greatly appreciated

Upvotes: 4

Views: 483

Answers (1)

akrun
akrun

Reputation: 887098

We can use base R methods for this task. We split the 'Strings' column (strsplit(...)), set the names of the output list with the sequence of rows, stack to convert to data.frame with key/value columns, get the frequency with table, convert to 'data.frame' and cbind with the original dataset.

 cbind(df1, as.data.frame.matrix(
                  table(
                    stack(
                     setNames(
                       strsplit(as.character(df1$Strings),','), 1:nrow(df1))
                           )[2:1])))
 #          Strings A B C D
 #1         A,B,C,D 1 1 1 1
 #2     A,B,B,D,D,D 1 2 0 3
 #3 A,A,A,A,B,C,D,D 4 1 1 2

or we can use mtabulate after splitting the column.

library(qdapTools)
cbind(df1, mtabulate(strsplit(as.character(df1$Strings), ',')))
#          Strings A B C D
#1         A,B,C,D 1 1 1 1
#2     A,B,B,D,D,D 1 2 0 3
#3 A,A,A,A,B,C,D,D 4 1 1 2

Update

For the new dataset 'initial', the second method works. If we need to use the first method with the correct order, convert to factor class with levels specified as the unique elements of 'ind'.

df1 <- stack(setNames(strsplit(as.character(initial$Strings), ','),
          seq_len(nrow(initial))))
df1$ind <- factor(df1$ind, levels=unique(df1$ind))
cbind(initial, as.data.frame.matrix(table(df1[2:1])))

Upvotes: 5

Related Questions