LNA
LNA

Reputation: 1447

Count word occurrences in R

Is there a function for counting the number of times a particular keyword is contained in a dataset?

For example, if dataset <- c("corn", "cornmeal", "corn on the cob", "meal") the count would be 3.

Upvotes: 33

Views: 120508

Answers (6)

ypa y yhm
ypa y yhm

Reputation: 219

You can use no library (only a pipe operator from magrittr) with these:

word count

library(magrittr); 

#' @author y.ypa.yhm
#' @license agpl-3.0
#' 

`%names%` = function (k,v) `names<-` (v,k) ;

wordcount = `%wc%` = 
function (texts, words) texts %names% texts %>% 
    strsplit (" ") %>% 
    lapply (\ (words) words %>% 
        split (words) %>% 
        lapply (length) ) %>% 
    {if (length(words) == 0) . else lapply (., \ (counts) 
        counts[words] )} ;

#' @examples 
#' 
#' `c("aaa bbb CCC ddd bbb CC", "bb CC eee 1 bb CCC PPP") %wc% c("eee","PPP")`
#' 
## $`aaa bbb CCC ddd bbb CC`
## $`aaa bbb CCC ddd bbb CC`$<NA>
## NULL
## 
## $`aaa bbb CCC ddd bbb CC`$<NA>
## NULL
## 
## 
## $`bb CC eee 1 bb CCC PPP`
## $`bb CC eee 1 bb CCC PPP`$eee
## [1] 1
## 
## $`bb CC eee 1 bb CCC PPP`$PPP
## [1] 1
## 
## 

use:

c("corn"
, "cornmeal"
, "corn on the cob"
, "meal") %wc% "corn" %>% 
    
    unlist %>% 
    sum (na.rm = T)

This will returns 2. If you want to see the detail, just dataset %wc% "corn".

substring count

library(magrittr); 

#' @author y.ypa.yhm
#' @license agpl-3.0
#' 

char.apart = 
function (str) str %>% nchar %>% {.+1} %>% seq %>% sample(1) %>% intToUtf8 %>% 
    {if (! (. %in% strsplit(str,"")[[1]])) . else char.apart (str)} ;

`%names%` = function (k,v) `names<-` (v,k) ;

strsubcnt = `%strsubcnt%` = 
function (strs, subs) subs %names% subs %>% 
    lapply (\ (sub) (\ (rchar) strs %>% 
        paste0 (rchar) %>% 
        `names<-` (strs) %>% 
        strsplit (sub) %>% 
        lapply (length) %>% 
        lapply (\(a) a - 1) 
    ) (sub %>% char.apart) ) ;

#' @examples
#' 
#' `c("acda","bc db","xcdye") %strsubcnt% c("c"," db","ye")`
#' 
#' outs: 
#' 
## $c
## $c$acda
## [1] 1
## 
## $c$`bc db`
## [1] 1
## 
## $c$xcdye
## [1] 1
## 
## 
## $` db`
## $` db`$acda
## [1] 0
## 
## $` db`$`bc db`
## [1] 1
## 
## $` db`$xcdye
## [1] 0
## 
##
## $ye
## $ye$acda
## [1] 0
## 
## $ye$`bc db`
## [1] 0
## 
## $ye$xcdye
## [1] 1
## 
## 

It can only count character in string, wish no regex feature. So it might can't do things like count words.

use:

c("corn"
, "cornmeal"
, "corn on the cob"
, "meal") %strsubcnt% "corn" %>% 
    
    unlist %>% 
    sum (na.rm = T)

This will returns 3. Also, to see the detail, just dataset %strsubcnt% "corn".


Tested on webR REPL app

Upvotes: 0

Nadir Latif
Nadir Latif

Reputation: 3763

You can use the str_count function from the stringr package to get the number of keywords that match a given character vector.

The pattern argument of the str_count function accepts a regular expression that can be used to specify the keyword.

The regular expression syntax is very flexible and allows matching whole words as well as character patterns.

For example the following code will count all occurrences of the string "corn" and will return 3:

sum(str_count(dataset, regex("corn")))

To match complete words use:

sum(str_count(dataset, regex("\\bcorn\\b")))

The "\b" is used to specify a word boundary. When using str_count function, the default definition of word boundary includes apostrophe. So if your dataset contains the string "corn's", it would be matched and included in the result.

This is because apostrophe is considered as a word boundary by default. To prevent words containing apostrophe from being counted, use the regex function with parameter uword = T. This will cause the regular expression engine to use the unicode TR 29 definition of word boundaries. See http://unicode.org/reports/tr29/tr29-4.html. This definition does not consider apostrophe as a word boundary.

The following code will give the number of time the word "corn" occurs. Words such as "corn's" will not be included.

sum(str_count(dataset, regex("\\bcorn\\b", uword = T)))

Upvotes: 1

Benbob
Benbob

Reputation: 378

I'd just do it with string division like:

library(roperators)

dataset <- c("corn", "cornmeal", "corn on the cob", "meal")

# for each vector element:
dataset %s/% 'corn'

# for everything:
sum(dataset %s/% 'corn') 

Upvotes: 1

Junaid
Junaid

Reputation: 3945

You can also do something like the following:

length(dataset[which(dataset=="corn")])

Upvotes: 2

petermeissner
petermeissner

Reputation: 12861

Another quite convenient and intuitive way to do it is to use the str_count function of the stringr package:

library(stringr)
dataset <- c("corn", "cornmeal", "corn on the cob", "meal")

# for mere occurences of the pattern:
str_count(dataset, "corn")
# [1] 1 1 1 0

# for occurences of the word alone:
str_count(dataset, "\\bcorn\\b")
# [1] 1 0 1 0

# summing it up
sum(str_count(dataset, "corn"))
# [1] 3

Upvotes: 35

IRTFM
IRTFM

Reputation: 263301

Let's for the moment assume you wanted the number of element containing "corn":

length(grep("corn", dataset))
[1] 3

After you get the basics of R down better you may want to look at the "tm" package.

EDIT: I realize that this time around you wanted any-"corn" but in the future you might want to get word-"corn". Over on r-help Bill Dunlap pointed out a more compact grep pattern for gathering whole words:

grep("\\<corn\\>", dataset)

Upvotes: 48

Related Questions