Reputation: 1114
new to R. I am looking to create a function to count the number of rows that contain 1 or more of the following words ("foo", "x", "y") from a column.
I then want to label that row with a variable, such as "1".
I have a data frame that looks like this: a->
id text time username
1 "hello x" 10 "me"
2 "foo and y" 5 "you"
3 "nothing" 15 "everyone"
4 "x,y,foo" 0 "know"
The correct output should be:
count: 3 new data frame
a2 ->
id text time username keywordtag
1 "hello x" 10 "me" 1
2 "foo and y" 5 "you" 1
3 "nothing" 15 "everyone"
4 "x,y,foo" 0 "know" 1
Any hints on how to do this would be appreciated!
Upvotes: 0
Views: 1896
Reputation: 1
Another way of Tyler Rinker's answer:
within(a,{keywordtag = as.numeric(grepl("foo|x|y", fixed = FALSE, a$keywordtag))})
Upvotes: 0
Reputation: 99371
This is probably much safer than my previous answer.
> string <- c("foo", "x", "y")
> a$keywordtag <-
(1:nrow(a) %in% c(sapply(string, grep, a$text, fixed = TRUE)))+0
> a
# id text time username keywordtag
# 1 1 hello x 10 me 1
# 2 2 foo and y 5 you 1
# 3 3 nothing 15 everyone 0
# 4 4 x,y,foo 0 know 1
Upvotes: 1
Reputation: 110024
Here are 2 approaches with base and qdap:
a <- read.table(text='id text time username
1 "hello x" 10 "me"
2 "foo and y" 5 "you"
3 "nothing" 15 "everyone"
4 "x,y,foo" 0 "know"', header=TRUE)
# Base
a$keywordtag <- as.numeric(grepl("\\b[foo]\\b|\\b[x]\\b|\\b[y]\\b", a$text))
a
# qdap
library(qdap)
terms <- termco(gsub("(,)([^ ])", "\\1 \\2", a$text),
id(a), list(terms = c(" foo ", " x ", " y ")))
a$keywordtag <- as.numeric(counts(terms)[[3]] > 0)
a
# output
## id text time username keywordtag
## 1 1 hello x 10 me 1
## 2 2 foo and y 5 you 1
## 3 3 nothing 15 everyone 0
## 4 4 x,y,foo 0 know 1
The base approach is bar far more eloquent and simple.
# EDIT (borrowing from Richard I believe this is most generalizable and undestandable):
words <- c("foo", "x", "y")
regex <- paste(sprintf("\\b[%s]\\b", words), collapse="|")
within(a,{
keywordtag = as.numeric(grepl(regex, a$text))
})
Upvotes: 2
Reputation: 44340
Your question boils down to splitting a vector of strings on multiple delimiters and checking if any of the tokens are in your set of desired words. You can split on multiple delimiters using strsplit
(I'll use comma and whitespace, since your question doesn't specify the full set of delimiters for your problem), and I'll use intersect
to check if it contains any word in your set:
m <- c("foo", "x", "y")
a$keywordtag <- as.numeric(unlist(lapply(strsplit(as.character(a$text), ",|\\s"),
function(x) length(intersect(x, m)) > 0)))
a
# id text time username keywordtag
# 1 1 hello x 10 me 1
# 2 2 foo and y 5 you 1
# 3 3 exciting 15 everyone 0
# 4 4 x,y,foo 0 know 1
I've included "exciting", which is a word that contains "x" but that isn't listed as a match by this approach.
Upvotes: 1