Wiam Nasr
Wiam Nasr

Reputation: 43

String matching by contents of columns

I have a huge data frame and several key words (all non numeric). I want to write R code to go over the whole data frame and look for the columns where the sentences contain one or more of these key words. Then if there was a match to put the key words in the same row in a new column (if more than one matching put them both separated by a comma or in a new column as well).

for instance, given the data below I want to add a column that shows what is matching from these key words:

keywords<- c("Smith", "Carla")

Then I want the results to look something like this:

**Names**          **Matching**

John Smith          Smith
Carla Smith         Carla, Smith **(could be same column or different column)**
Smith Smith         Smith
John Carla          Carla

I tried using grep:

 Matching <- Data[grepl("carla",Data$Names), ]

Can you please help me?

Upvotes: 2

Views: 71

Answers (1)

Richard
Richard

Reputation: 1294

This answer has two parts: OP edited his answer, but first part still seems useful

Part 1: OP's original question

it usually helps to break down your taks in smaller ones and provide a minimal example.

So here's some data

shoes <- c("cookie", "nike", "adidas")
drinks <- c("water", "lemon", "cookie")
clothes <- c("pants", "cookie", "sweater")
df <- data.frame(shoes, drinks, clothes, stringsAsFactors = FALSE)
df

Now let's go with @akrun's comment and just try and see whether we can get the string "cookie" from a single column:

library(stringr)
str_extract_all("cookie", df$shoes) == "cookie"

So, that works, now we need to do it for all columns. To help us on the way we write a small function and loop that over the columns:

extract_cookie <- function(x) {
    x <- as.character(x) # just to safeguard against non-string values .
    str_extract_all("cookie", x) == "cookie"
}
sapply(df, extract_cookie)
     shoes drinks clothes
[1,]  TRUE  FALSE   FALSE
[2,] FALSE  FALSE    TRUE
[3,] FALSE   TRUE   FALSE

Part 2: (after OP's edited the question)

Since you now mention your own efforts using grepl..

people <- c("John Smith", "Carla Smith", "Smith Smith", "John Carla")
persons <- data.frame(people, stringsAsFactors = FALSE)

persons$smiths <- grepl("Smith", persons$people)
persons$carlas <- grepl("Carla", persons$people)
persons$perfectMatch <- persons$smiths == TRUE & persons$carlas == TRUE

persons$smiths2 <- ifelse(grepl("Smith", persons$people), "Smiths", "")
persons$carlas2 <- ifelse(grepl("Carla", persons$people), "Carla", "")
persons$perfectMatch2 <- ifelse(persons$perfectMatch == TRUE, 
                                    paste(persons$carlas2, persons$smiths2), "")
persons

       people smiths carlas perfectMatch smiths2 carlas2 perfectMatch2
1  John Smith   TRUE  FALSE        FALSE  Smiths                      
2 Carla Smith   TRUE   TRUE         TRUE  Smiths   Carla  Carla Smiths
3 Smith Smith   TRUE  FALSE        FALSE  Smiths                      
4  John Carla  FALSE   TRUE        FALSE           Carla              

Upvotes: 1

Related Questions