Reputation: 4648
I have a dataframe with 2 columns and 20 rows of Title of an article and description of that article. I have few key words that I would like to match against these 2 columns. If there is a match with the key word, It should return a Value 1 else 0. I tried simple function like (my.df == "human" ) + 0
. However, it does not work as expected as it cannot find exact match, even though there is a word human somewhere in the title. Any suggestions and help is appreciated. Thank you
Below is example:
my.keyword<- c("human", "lung", "mutation", "chromosome")
# sample df. created from web
my.df
Title Description
Atlas of mutations in Lung cancer is the leading cause of cancer
related mortality in the United States, with
an estimated 221,200 new cases and 158,040
deaths anticipated in 2015 (ACS 2015).
The complexity increases when I would like to search all the char in my.keyword
object, without for loop. I would like to get an output if there is a match with human, lung, mutation, chromosome in title...the output result should be 4. If only 3 match out of 4, the result should be 3. Same, in the case of description. Irrespective of repetition of the word, it just should be one value for a match.
Thank you
Upvotes: 1
Views: 792
Reputation: 1268
One way to do this is using grepl
. Here's some sample data expanding upon yours:
# Create sample data
Title <- c("Atlas of mutations in",
"Monkey lungs",
"Flatulence and the art of chromosome mutation",
"No keywords here")
Description = c("Lung cancer is the leading cause of cancer
related mortality in the United States, with
an estimated 221,200 new cases and 158,040
deaths anticipated in 2015 (ACS 2015).",
"That was it, the monkeys had had enough
and began the ferocious flinging of feces
about the room as madness broke out and
everyone started their chromosome mutations. The monkey
kingdom would rise again",
"Once upon a time there was a human that
had trouble with R and sought out stack overflow
for help",
"Strange days and strange times for the human race")
my.df <- data.frame(Title = Title,
Description = gsub("\n", "", Description))
Here's a method for extracting the presence of your keywords in Description
:
fun <- function(x) grepl(x, my.df$Description, ignore.case = T)
keywordsDescrip <- as.data.frame(1*sapply(my.keyword, fun))
keywordsDescrip$sum <- rowSums(keywordsDescrip)
And the output:
> keywordsDescrip
human lung mutation chromosome sum
1 0 1 0 0 1
2 0 0 1 1 2
3 1 0 0 0 1
4 1 0 0 0 1
Just repeat the above process swapping out my.df$Description
for my.df$Title
to assess the appearance of your keywords in that field.
Upvotes: 3
Reputation: 59345
my.keyword<- c("human", "lung", "mutation", "chromosome")
txt <- "Human lung cancer due to chromosome mutations is the leading cause of cancer related mortality in the United States, with an estimated 221,200 new cases and 158,040 deaths anticipated in 2015 (ACS 2015). "
count.kw <- function(txt) sum(sapply(my.keyword, grepl, x=tolower(txt), fixed=TRUE))
count.kw(txt)
# [1] 4
Notice how I "edited" your text to include more than one of the keywords
This works for 1 string, but not a vector of strings, so we have to vectorize the function:
vcount.lw <- Vectorize(count.kw)
Then, create an example:
set.seed(1)
rwords <- function(x) paste(paste(my.keyword[sample(1:4,sample(1:4))], collapse= " "),"blah, blah, blah")
df <- data.frame(Title=sapply(1:10,rwords))
and demonstrate the solution.
vcount.lw(df$Title)
# [1] 2 4 4 3 2 3 4 2 2 2
Upvotes: 1