Reputation: 45
Total newbie R question. I have a data frame df of ID and Notes:
ID Notes
1 dogs are friendly
2 dogs and cats are pets
3 cows live on farms
4 cats and cows start with c
I have another list of values "animals"
cats
cows
I want to add another column "match" to my data frame that contains all of the animals in the Notes, e.g
ID Notes Matches
1 dogs are friendly
2 dogs and cats are pets cats
3 cows live on farms cows
4 cats and cows start with c cats, cows
So far the only luck I've had is using grepl to return if there are any matches:
grepl(paste(animals,collapse="|"),df$Notes,ignore.case = T)
How do I return the values instead?
Update
There are some rows in my data frame where I have multiple instance of cats, for example, in my Notes:
ID Notes Matches
1 dogs are friendly
2 dogs and cats are pets cats
3 cows live on farms cows
4 cats and cats cows start with c cats, cows
I only want to return one instance of the match. @LachlanO gets me very close with his solution, but I get:
[1] "NA, NA" "cats, NA" "NA, cows" "c(\"cats\", \"cats\"), cows"
How can I return only distinct matches?
Upvotes: 1
Views: 181
Reputation: 79238
You can use gsub
to be able to obtain all the animals at once:
gsub(".*?(cows|cats )|.*","\\1",do.call(paste,df),perl = T)
[1] "" "cats " "cows" "cats cows"
Thus to write in one lane:
transform(df,matches=gsub(".*?(cows|cats )|.*","\\1",do.call(paste,df),perl = T))
ID Notes matches
1 1 dogs are friendly
2 2 dogs and cats are pets cats
3 3 cows live on farms cows
4 4 cats and cows start with c cats cows
Upvotes: 0
Reputation: 1162
EDIT: Added a unique
operation to deal with duplicate matches.
I can start you off, then point you in a direction :)
The below uses stringr::str_extract_all to extract the relevant bits we need, but it unfortunately leaves us with bits we don't, most notably when it's blank. The unique
function in the middle of our custom function just makes sure we take the unique matches element by element.
ID = seq(1,4)
Notes <- c(
"dogs are friendly",
"dogs and cats are pets",
"cows live on farms",
"cats and cows start with c "
)
df <- data.frame(ID, Notes)
animals = c("cats", "cows")
matches <- as.data.frame(sapply(animals, function(x){sapply(stringr::str_extract_all(df$Notes, x), unique)}, simplify = TRUE))
matches[matches == "character(0)"] <- NA
apply(matches, 1, paste, collapse = ", ")
[1] "NA, NA" "cats, NA" "NA, cows" "cats, cows"
You could set this as your extra column, except it's no good because of those NAs. If there was a paste function which ignored NAs we'd be set.
Luckily another user has already solved this problem :) Check out this answer here.
That in combination with the above should give you a suitable solution!
Upvotes: 1
Reputation: 145785
Here's how I would do it:
animals = c("cats", "cows")
reg = paste(animals, collapse = "|")
library(stringr)
matches = str_extract_all(Notes, reg)
matches = lapply(matches, unique)
matches = sapply(matches, paste, collapse = ",")
df$matches = matches
df
# ID Notes matches
# 1 1 dogs are friendly
# 2 2 dogs and cats are pets cats
# 3 3 cows live on farms cows
# 4 4 cats and cows start with c cats,cows
If you want to fancy it up, paste word boundaries on the regex, like reg = paste("\\b", animals, "\\b", collapse = "|")
to avoid extracting the middle of words.
Using the data nicely provided by LachlanO:
ID = seq(1,4)
Notes <- c(
"dogs are friendly",
"dogs and cats are pets",
"cows live on farms",
"cats and cows start with c "
)
df <- data.frame(ID, Notes)
Upvotes: 0