lmcshane
lmcshane

Reputation: 1114

R: count word occurrence by row and create variable

new to R. I am looking to create a function to count the number of rows that contain 1 or more of the following words ("foo", "x", "y") from a column.

I then want to label that row with a variable, such as "1".

I have a data frame that looks like this: a->

 id     text        time   username
 1     "hello x"     10     "me"
 2     "foo and y"   5      "you"
 3     "nothing"     15     "everyone"
 4     "x,y,foo"     0      "know"

The correct output should be:

count: 3 new data frame

a2 ->

id     text        time   username        keywordtag  
 1     "hello x"     10     "me"          1
 2     "foo and y"   5      "you"         1
 3     "nothing"     15     "everyone"     
 4     "x,y,foo"     0      "know"        1

Any hints on how to do this would be appreciated!

Upvotes: 0

Views: 1896

Answers (4)

oguz
oguz

Reputation: 1

Another way of Tyler Rinker's answer:

within(a,{keywordtag = as.numeric(grepl("foo|x|y", fixed = FALSE, a$keywordtag))})

Upvotes: 0

Rich Scriven
Rich Scriven

Reputation: 99371

This is probably much safer than my previous answer.

> string <- c("foo", "x", "y")
> a$keywordtag <- 
      (1:nrow(a) %in% c(sapply(string, grep, a$text, fixed = TRUE)))+0
> a
#   id      text time username keywordtag
# 1  1   hello x   10       me          1
# 2  2 foo and y    5      you          1
# 3  3   nothing   15 everyone          0
# 4  4   x,y,foo    0     know          1

Upvotes: 1

Tyler Rinker
Tyler Rinker

Reputation: 110024

Here are 2 approaches with base and qdap:

a <- read.table(text='id     text        time   username
 1     "hello x"     10     "me"
 2     "foo and y"   5      "you"
 3     "nothing"     15     "everyone"
 4     "x,y,foo"     0      "know"', header=TRUE)

# Base

a$keywordtag <- as.numeric(grepl("\\b[foo]\\b|\\b[x]\\b|\\b[y]\\b", a$text))
a

# qdap

library(qdap)
terms <- termco(gsub("(,)([^ ])", "\\1 \\2", a$text), 
    id(a), list(terms = c(" foo ", " x ", " y ")))
a$keywordtag <- as.numeric(counts(terms)[[3]] > 0)
a

# output

##   id      text time username keywordtag
## 1  1   hello x   10       me          1
## 2  2 foo and y    5      you          1
## 3  3   nothing   15 everyone          0
## 4  4   x,y,foo    0     know          1

The base approach is bar far more eloquent and simple.

# EDIT (borrowing from Richard I believe this is most generalizable and undestandable):

words <- c("foo", "x", "y")
regex <- paste(sprintf("\\b[%s]\\b", words), collapse="|")
within(a,{
    keywordtag = as.numeric(grepl(regex, a$text))
})

Upvotes: 2

josliber
josliber

Reputation: 44340

Your question boils down to splitting a vector of strings on multiple delimiters and checking if any of the tokens are in your set of desired words. You can split on multiple delimiters using strsplit (I'll use comma and whitespace, since your question doesn't specify the full set of delimiters for your problem), and I'll use intersect to check if it contains any word in your set:

m <- c("foo", "x", "y")
a$keywordtag <- as.numeric(unlist(lapply(strsplit(as.character(a$text), ",|\\s"),
                                         function(x) length(intersect(x, m)) > 0)))
a
#   id      text time username keywordtag
# 1  1   hello x   10       me          1
# 2  2 foo and y    5      you          1
# 3  3  exciting   15 everyone          0
# 4  4   x,y,foo    0     know          1

I've included "exciting", which is a word that contains "x" but that isn't listed as a match by this approach.

Upvotes: 1

Related Questions