Reputation: 8247

how to match strings of vectors with dataframe column in R

I have following dataframe in r

Id    titles
1     emami paper mills slips 10% on dismal q4 numbers
2     jsw steel q4fy17 standalone net profit rises 173.33%
3     fmcg major hul q4fy17 standalone net profit rises 6.2
4     chennai petroleum, allsec tech slip 6-7% on poor q4

And, I have names in vectors

names <- c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs","chennai petroleum corp ltd")

I want to match the dataframe column titles with strings of vectors and print corresponding string in a new column. My desired dataframe is

 Id    titles                                                    names
1     emami paper mills slips 10% on dismal q4 numbers           emami ltd
2     jsw steel q4fy17 standalone net profit rises 173.33%       jsw steel ltd
3     fmcg major hul q4fy17 standalone net profit rises 6.2      hul india ltd
4     chennai petroleum, allsec tech slip 6-7% on poor q4        chennai petroleum corp ltd

I am doing it with following code,but it does not give me what I want.

df[grepl(paste(names, collapse="|"), df$titles),]

How to do it in R?

Upvotes: 3

Answers (4)

PKumar

Reputation: 11128

If I understood you correctly then , you can use BaseR's gregexpr along with regematches and gsub to complete your task.

Data: EDIT After OP has changed the question

options(stringsAsFactors = F)
df <- data.frame(titles = c("emami paper mills slips 10% on dismal q4 numbers",
                            "jsw steel q4fy17 standalone net profit rises 173.33%",
                            "fmcg major hul q4fy17 standalone net profit rises 6.2",
                            "chennai petroleum, allsec tech slip 6-7% on poor q4"),stringsAsFactors = F)

names <- c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs","chennai petroleum corp ltd")

Regex:

library(dplyr)
library(stringr)

newnames <- gsub("^(\\w+).*","\\1",names)
regmat <- regmatches(df$titles,gregexpr(paste0(newnames,collapse="|"),df$titles))
regmat[lapply(regmat,length) == 0] <- NA
df <- data.frame(cbind(df,newnames =do.call("rbind",regmat)),stringsAsFactors = F)
df1 <- data.frame(names=names,newnames=newnames,stringsAsFactors = F)
left_join(df,df1,by="newnames")

You can also use stringr library like below:

library(stringr)
newnames <- str_replace(names,"^(\\w+).*","\\1")
df$newnames <- str_extract(df$titles,paste0(newnames,collapse="|"))
df1 <- data.frame(names=names,newnames=newnames,stringsAsFactors = F)
left_join(df,df1,by="newnames")

output:

    > left_join(df,df1,by="newnames")
                                                 titles newnames                      names
1      emami paper mills slips 10% on dismal q4 numbers    emami                  emami ltd
2  jsw steel q4fy17 standalone net profit rises 173.33%      jsw              jsw steel ltd
3 fmcg major hul q4fy17 standalone net profit rises 6.2      hul              hul india ltd
4   chennai petroleum, allsec tech slip 6-7% on poor q4  chennai chennai petroleum corp ltd

Upvotes: 2

Nico Coallier

Reputation: 686

To add to the previous answer I have made a function including some of the previous comments :

df <-  data.frame(title=c("emami paper mills slips 10% on dismal q4 numbers",
                            "jsw steel q4fy17 standalone net profit rises 173.33%",
                            "fmcg major hul q4fy17 standalone net profit rises 6.2"))


names <- c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs")

find_string <- function(data,names){

    ### Clean the names 
    newnames <- gsub("^(\\w+).*","\\1",names)

    ### Loop over the names to find which sentence contain it
    for(i in 1:length(newnames)){

        if(length(grep(newnames[i],df$title)) != 0){
            df$names[grep(newnames[i],df$title)] <- newnames[i]

        }else{
            print(paste(names[i],"not found in the data!"))
        }
    }
    return(df)
}

### Run the function

find_string(df,names)

Hope this helps !

Upvotes: 0

Mike H.

Reputation: 14370

It's also possible to use sqldf for this type of "fuzzy" merge.

Construct lookup:

names <- data.frame(name = c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs"))
names$lookup <- gsub("(\\w+).*", "\\1", names$name)

Perform merge:

library(sqldf)
res <- sqldf("SELECT l.*, r.name
       FROM df as l
       LEFT JOIN names as r
       ON l.titles LIKE '%'||r.lookup||'%'")

A few notes: I extract the first word from the lookup since you said you want "hul" only, and not "hul india". Also in sql the || means concatenate and % means wildcard (which will match anything), so this will match if any of the lookups appear anywhere in the text no matter what is before or after it.

Another option using Reduce and then merging would be:

df$lookup <- Reduce( function(x, y) {x[grepl(y,x)] <- y; x}, c(list(df$titles), names$lookup))
merge(df, names)

Upvotes: 0

cmbarbu

Reputation: 4532

Remove the ltd from your names :

names <- gsub(" ltd","",names)

Upvotes: 0

how to match strings of vectors with dataframe column in R

Answers (4)

Related Questions