epr8n
epr8n

Reputation: 45

How to compare a data frame to a list, and return values in the data frame matching the list?

Total newbie R question. I have a data frame df of ID and Notes:

ID    Notes
1     dogs are friendly
2     dogs and cats are pets
3     cows live on farms
4     cats and cows start with c

I have another list of values "animals"

cats
cows

I want to add another column "match" to my data frame that contains all of the animals in the Notes, e.g

ID    Notes                        Matches
1     dogs are friendly            
2     dogs and cats are pets       cats
3     cows live on farms           cows
4     cats and cows start with c   cats, cows

So far the only luck I've had is using grepl to return if there are any matches:

grepl(paste(animals,collapse="|"),df$Notes,ignore.case = T)

How do I return the values instead?

Update
There are some rows in my data frame where I have multiple instance of cats, for example, in my Notes:

ID    Notes                             Matches
1     dogs are friendly            
2     dogs and cats are pets            cats
3     cows live on farms                cows
4     cats and cats cows start with c   cats, cows

I only want to return one instance of the match. @LachlanO gets me very close with his solution, but I get:

[1] "NA, NA"                      "cats, NA"                    "NA, cows"                    "c(\"cats\", \"cats\"), cows"

How can I return only distinct matches?

Upvotes: 1

Views: 181

Answers (3)

Onyambu
Onyambu

Reputation: 79238

You can use gsub to be able to obtain all the animals at once:

gsub(".*?(cows|cats )|.*","\\1",do.call(paste,df),perl = T)
[1] ""          "cats "     "cows"      "cats cows"

Thus to write in one lane:

transform(df,matches=gsub(".*?(cows|cats )|.*","\\1",do.call(paste,df),perl = T))
  ID                       Notes   matches
1  1           dogs are friendly          
2  2      dogs and cats are pets     cats 
3  3          cows live on farms      cows
4  4 cats and cows start with c  cats cows

Upvotes: 0

LachlanO
LachlanO

Reputation: 1162

EDIT: Added a unique operation to deal with duplicate matches.

I can start you off, then point you in a direction :)

The below uses stringr::str_extract_all to extract the relevant bits we need, but it unfortunately leaves us with bits we don't, most notably when it's blank. The unique function in the middle of our custom function just makes sure we take the unique matches element by element.

ID = seq(1,4)
Notes <- c(
  "dogs are friendly",
  "dogs and cats are pets",
  "cows live on farms",
  "cats and cows start with c "
)
df <- data.frame(ID, Notes)

animals = c("cats", "cows")

matches <- as.data.frame(sapply(animals, function(x){sapply(stringr::str_extract_all(df$Notes, x), unique)}, simplify = TRUE))
matches[matches == "character(0)"] <- NA

apply(matches, 1, paste, collapse = ", ")
[1] "NA, NA"     "cats, NA"   "NA, cows"   "cats, cows"

You could set this as your extra column, except it's no good because of those NAs. If there was a paste function which ignored NAs we'd be set.

Luckily another user has already solved this problem :) Check out this answer here.

That in combination with the above should give you a suitable solution!

Upvotes: 1

Gregor Thomas
Gregor Thomas

Reputation: 145785

Here's how I would do it:

animals = c("cats", "cows")
reg = paste(animals, collapse = "|")

library(stringr)
matches = str_extract_all(Notes, reg)
matches = lapply(matches, unique)
matches = sapply(matches, paste, collapse = ",")

df$matches = matches
df
#   ID                       Notes   matches
# 1  1           dogs are friendly          
# 2  2      dogs and cats are pets      cats
# 3  3          cows live on farms      cows
# 4  4 cats and cows start with c  cats,cows

If you want to fancy it up, paste word boundaries on the regex, like reg = paste("\\b", animals, "\\b", collapse = "|") to avoid extracting the middle of words.


Using the data nicely provided by LachlanO:

ID = seq(1,4)
Notes <- c(
  "dogs are friendly",
  "dogs and cats are pets",
  "cows live on farms",
  "cats and cows start with c "
)
df <- data.frame(ID, Notes)

Upvotes: 0

Related Questions