rdatasculptor
rdatasculptor

Reputation: 8447

R Text mining - how to change texts in R data frame column into several columns with word frequencies?

I have a data frame with 4 columns. Column 1 consists of ID's, column 2 consists of texts (about 100 words each), column 3 and 4 consist labels.

Now I would like to retrieve word frequencies (of the most common words) from the texts column and add those frequencies as extra columns to the data frame. I would like the column names to be the words themselves and the columns filled with their frequencies (ranging from 0 to ... per text) in the texts.

I tried some functions of the tm package but until now unsatisfactory. Does anyone has any idea how to deal with this problem or where to start? Is there a package that can do the job?

id  texts   label1    label2

Upvotes: 2

Views: 4122

Answers (1)

Tyler Rinker
Tyler Rinker

Reputation: 109924

Well let's work through the issues then...

I'm guessing you have a data.frame that looks like this:

       person sex adult                                 state code
1         sam   m     0         Computer is fun. Not too fun.   K1
2        greg   m     0               No it's not, it's dumb.   K2
3     teacher   m     1                    What should we do?   K3
4         sam   m     0                  You liar, it stinks!   K4
5        greg   m     0               I am telling the truth!   K5
6       sally   f     0                How can we be certain?   K6
7        greg   m     0                      There is no way.   K7
8         sam   m     0                       I distrust you.   K8
9       sally   f     0           What are you talking about?   K9
10 researcher   f     1         Shall we move on?  Good then.  K10
11       greg   m     0 I'm hungry.  Let's eat.  You already?  K11

This data set comes from the qdap package. to get qdap use install.packages("qdap").

Now to make the reproducible example I was talking about with your data set do what I'm doing here with the DATA data set from qdap.

DATA
dput(head(DATA))

Ok now for your original problem I think wfm will do what you want:

freqs <- t(wfm(DATA$state, 1:nrow(DATA)))
data.frame(DATA, freqs, check.names = FALSE)

If you only wanted the top so many words use an ordering technique like I use here:

freqs <- t(wfm(DATA$state, 1:nrow(DATA)))
ords <- rev(sort(colSums(freqs)))[1:9]      #top 9 words
top9 <- freqs[, names(ords)]                #grab those columns from freqs  
data.frame(DATA, top9, check.names = FALSE) #put it together

The outcome looks like this:

> data.frame(DATA, top9, check.names = FALSE)
       person sex adult                                 state code you we what not no it's is i fun
1         sam   m     0         Computer is fun. Not too fun.   K1   0  0    0   1  0    0  1 0   2
2        greg   m     0               No it's not, it's dumb.   K2   0  0    0   1  1    2  0 0   0
3     teacher   m     1                    What should we do?   K3   0  1    1   0  0    0  0 0   0
4         sam   m     0                  You liar, it stinks!   K4   1  0    0   0  0    0  0 0   0
5        greg   m     0               I am telling the truth!   K5   0  0    0   0  0    0  0 1   0
6       sally   f     0                How can we be certain?   K6   0  1    0   0  0    0  0 0   0
7        greg   m     0                      There is no way.   K7   0  0    0   0  1    0  1 0   0
8         sam   m     0                       I distrust you.   K8   1  0    0   0  0    0  0 1   0
9       sally   f     0           What are you talking about?   K9   1  0    1   0  0    0  0 0   0
10 researcher   f     1         Shall we move on?  Good then.  K10   0  1    0   0  0    0  0 0   0
11       greg   m     0 I'm hungry.  Let's eat.  You already?  K11   1  0    0   0  0    0  0 0   0

Upvotes: 7

Related Questions