SamFlynn
SamFlynn

Reputation: 379

Extract total frequency of words from vector in R

This is the vector I have:

 posts = c("originally by: cearainmy only concern with csm is they seem a bit insulated from players.  they have private message boards where it appears most of their work goes on.  i would bet they are posting more there than in jita speakers corner.  i think that is unfortunate because its hard to know who to vote for if you never really see what positions they hold.  its sort of like ccp used to post here on the forums then they stopped.  so they got a csm to represent players and use jita park forum to interact.  now the csm no longer posts there as they have their internal forums where they hash things out.  perhaps we need a csm to the csm to find out what they are up to.i don't think you need to worry too much. the csm has had an internal forum for over 2 years, although it is getting used a lot more now than it was. a lot of what goes on in there is nda stuff that we couldn't discuss anyway.i am quite happy to give my opinion on any topic, to the extent that the nda allows, and i" , "fot those of you bleating about imagined nda scandals as you attempt to cast yourselves as the julian assange of eve, here's a quote from the winter summit thread:originally by: sokrateszday 3post dominion 0.0 (3hrs!)if i had to fly to iceland only for this session i would have done it. we had gathered a list of items and prepared it a bit. important things we went over were supercaps, force projection, empire building, profitability of 0.0, objectives for small gangs and of course sovereingty.the csm spent 3 hours talking to ccp about how dominion had changed 0.0, and the first thing on sokratesz's list is supercaps. its not hard to figure out the nature of the discussion.on the other hand, maybe you're right, and the csm's priority for this discussion was to talk about how underpowered and useless supercarriers are and how they needed triple the ehp and dps from their current levels?(it wasn't)")

I want a data frame as a result, that would contain words and the frequecy of times they occur.

So result should look something like:

word   count
a        300
and      260
be       200
...      ...
...      ...

What I tried to do, was use tm

corpus <- VCorpus(VectorSource(posts))
corpus <-tm_map(corpus, removeNumbers)
corpus <-tm_map(corpus, removePunctuation)
m <- DocumentTermMatrix(corpus)

Running findFreqTerms(m, lowfreq =0, highfreq =Inf ) just gives me the words, so I understand its a sparse matrix, how do I extract the words and their frequency?

Is there a easier way to do this, maybe by not using tm at all?

Upvotes: 4

Views: 5482

Answers (5)

LMc
LMc

Reputation: 18732

termFreq will return a named vector (names are words and values are word counts):

library(tm)

txt <- PlainTextDocument(VectorSource(posts))
termFreq(txt, control = list(tolower = T, removeNumbers = T, removePunctuation = T))

Or using the qdap package, which will return a data frame:

qdap::freq_terms(posts, top = Inf)

Upvotes: 0

msubbaiah
msubbaiah

Reputation: 350

You've got two options. Depends if you want word count per document, or for all documents.

All Documents

library(dplyr)

count <- as.data.frame(t(inspect(m)))
sel_cols <- colnames(count)
count$word <-  rownames(count)
rownames(count) <- seq(length = nrow(count))
count$count <- rowSums(count[,sel_cols])
count <- count %>% select(word,count)
count <- count[order(count$count, decreasing=TRUE), ]

### RESULT of head(count)

#     word count
# 140  the    14
# 144 they    10
# 4    and     9
# 25   csm     7
# 43   for     5
# 55   had     4

This should capture occurrences across all documents (by use of rowSums).

Per Document

I would suggesting using the tidytext package, if you want word frequency per document.

library(tidytext)
m_td <- tidy(m)

Upvotes: 2

alistaire
alistaire

Reputation: 43354

The tidytext package allows fairly intuitive text mining, including tokenization. It is designed to work in a tidyverse pipeline, so it supplies a list of stop words ("a", "the", "to", etc.) to exclude with dplyr::anti_join. Here, you might do

library(dplyr)    # or if you want it all, `library(tidyverse)`
library(tidytext)

data_frame(posts) %>% 
    unnest_tokens(word, posts) %>% 
    anti_join(stop_words) %>% 
    count(word, sort = TRUE)

## # A tibble: 101 × 2
##        word     n
##       <chr> <int>
## 1       csm     7
## 2       0.0     3
## 3       nda     3
## 4       bit     2
## 5       ccp     2
## 6  dominion     2
## 7     forum     2
## 8    forums     2
## 9      hard     2
## 10 internal     2
## # ... with 91 more rows

Upvotes: 1

Sathish
Sathish

Reputation: 12723

posts = c("originally by: cearainmy only concern with csm is they seem a bit insulated from players.  they have private message boards where it appears most of their work goes on.  i would bet they are posting more there than in jita speakers corner.  i think that is unfortunate because its hard to know who to vote for if you never really see what positions they hold.  its sort of like ccp used to post here on the forums then they stopped.  so they got a csm to represent players and use jita park forum to interact.  now the csm no longer posts there as they have their internal forums where they hash things out.  perhaps we need a csm to the csm to find out what they are up to.i don't think you need to worry too much. the csm has had an internal forum for over 2 years, although it is getting used a lot more now than it was. a lot of what goes on in there is nda stuff that we couldn't discuss anyway.i am quite happy to give my opinion on any topic, to the extent that the nda allows, and i" , "fot those of you bleating about imagined nda scandals as you attempt to cast yourselves as the julian assange of eve, here's a quote from the winter summit thread:originally by: sokrateszday 3post dominion 0.0 (3hrs!)if i had to fly to iceland only for this session i would have done it. we had gathered a list of items and prepared it a bit. important things we went over were supercaps, force projection, empire building, profitability of 0.0, objectives for small gangs and of course sovereingty.the csm spent 3 hours talking to ccp about how dominion had changed 0.0, and the first thing on sokratesz's list is supercaps. its not hard to figure out the nature of the discussion.on the other hand, maybe you're right, and the csm's priority for this discussion was to talk about how underpowered and useless supercarriers are and how they needed triple the ehp and dps from their current levels?(it wasn't)")
posts <- gsub("[[:punct:]]", '', posts)  # remove punctuations
posts <- gsub("[[:digit:]]", '', posts)  # remove numbers
word_counts <- as.data.frame(table(unlist( strsplit(posts, "\ ") )))  # split vector by space
word_counts <- with(word_counts, word_counts[ Var1 != "", ] )  # remove empty characters
head(word_counts)
#       Var1 Freq
# 2        a    8
# 3    about    3
# 4   allows    1
# 5 although    1
# 6       am    1
# 7       an    1

Upvotes: 7

setempler
setempler

Reputation: 1751

Plain R solution, assuming all words are separated by space:

words <- strsplit(posts, " ", fixed = T)
words <- unlist(words)
counts <- table(words)

The names(counts) holds words, and values are the counts.

You might want to use gsub to get rid of (),.?: and 's, 't or 're as in your example. As in:

posts <- gsub("'t|'s|'t|'re", "", posts)
posts <- gsub("[(),.?:]", " ", posts)

Upvotes: 5

Related Questions