Ben
Ben

Reputation: 525

Count how often words from a vector occur in a string

I have a string of text and a vector of words:

String: "Auch ein blindes Huhn findet einmal ein Korn."
Vector: "auch", "ein"

I want to check how often each word in the vector is contained in the string and calculate the sum of the frequencies. For the example, the correct result would be 3.

I have come so far as to be able to check which words occur in the string and calculate the sum:

library(stringr)
deu <- c("\\bauch\\b", "\\bein\\b")
str_detect(tolower("Auch ein blindes Huhn findet einmal ein Korn."), deu)

[1] TRUE TRUE

sum(str_detect(tolower("Auch ein blindes Huhn findet einmal ein Korn."), deu))

[1] 2

Unfortunately str_detect does not return the number of occurences (1, 2), but only whether a word occurs in a string (TRUE, TRUE), so the sum of the output from str_detect is not equal to the number of words.

Is there a function in R similar to preg_match_all in PHP?

preg_match_all("/\bauch\b|\bein\b/i", "Auch ein blindes Huhn findet einmal ein Korn.", $matches);
print_r($matches);

Array
(
    [0] => Array
        (
            [0] => Auch
            [1] => ein
            [2] => ein
        )

)

echo preg_match_all("/\bauch\b|\bein\b/i", "Auch ein blindes Huhn findet einmal ein Korn.", $matches);

3

I would like to avoid loops.


I have looked at a lot of similar questions, but they either don't count the number of occurrences or do not use a vector of patterns to search. I may have overlooked a question that answers mine, but before you mark this as duplicate, please make sure that the "duplicate" actually asks the exact same thing. Thank you.

Upvotes: 3

Views: 83

Answers (4)

Friede
Friede

Reputation: 7979

Character String Processing

If base R is too complex in its syntax, I would go with {stringi}

stringi::stri_count_regex(tolower(String), sprintf('\\b%s\\b', Vector)) |> 
  setNames(Vector) # optional
auch  ein 
   1    2 

Data

String = 'Auch ein blindes Huhn findet einmal ein Korn.'
Vector = c('auch', 'ein')

Upvotes: 2

ThomasIsCoding
ThomasIsCoding

Reputation: 102529

Given string and pattern like below

s <- "Auch ein blindes Huhn findet einmal ein Korn."
p <- c("auch", "ein")

you can try strsplit + %in%:

  • Option 1 (to get the sum of occurrences)
> sum(gsub("\\W", "", strsplit(tolower(s), " ")[[1]]) %in% p)
[1] 3
  • Option 2 (use table if you would like to see the summary of counts)
> table(gsub("\\W", "", strsplit(tolower(s), " ")[[1]]))[p]

auch  ein
   1    2

Upvotes: 2

Tim G
Tim G

Reputation: 4147

You can use str_count like

stringr::str_count(tolower("Auch ein blindes Huhn findet mal ein Korn"), paste0("\\b", tolower(c("ein","Huhn")), "\\b"))
[1] 2 1

Upvotes: 5

jay.sf
jay.sf

Reputation: 73562

You could sprintf a pattern by adding \\b for borders and use lengths on gregexpr.

> vp <- v |> sprintf(fmt='\\b%s\\b') |> setNames(v) |> print()
        auch          ein 
"\\bauch\\b"  "\\bein\\b" 
> lapply(vp, gregexpr, text=tolower(string)) |> unlist(recursive=FALSE) |> lengths()
auch  ein 
   1    2 

The |> print() is just for simultaneously assigning and printing and can be removed.


Data:

string <- "Auch ein blindes Huhn findet einmal ein Korn."
v <- c("auch", "ein")

Upvotes: 3

Related Questions