Reputation: 8247
I have following dataframe in r
Id titles
1 emami paper mills slips 10% on dismal q4 numbers
2 jsw steel q4fy17 standalone net profit rises 173.33%
3 fmcg major hul q4fy17 standalone net profit rises 6.2
4 chennai petroleum, allsec tech slip 6-7% on poor q4
And, I have names in vectors
names <- c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs","chennai petroleum corp ltd")
I want to match the dataframe column titles with strings of vectors and print corresponding string in a new column. My desired dataframe is
Id titles names
1 emami paper mills slips 10% on dismal q4 numbers emami ltd
2 jsw steel q4fy17 standalone net profit rises 173.33% jsw steel ltd
3 fmcg major hul q4fy17 standalone net profit rises 6.2 hul india ltd
4 chennai petroleum, allsec tech slip 6-7% on poor q4 chennai petroleum corp ltd
I am doing it with following code,but it does not give me what I want.
df[grepl(paste(names, collapse="|"), df$titles),]
How to do it in R?
Upvotes: 3
Views: 246
Reputation: 11128
If I understood you correctly then , you can use BaseR's gregexpr
along with regematches
and gsub
to complete your task.
Data: EDIT After OP has changed the question
options(stringsAsFactors = F)
df <- data.frame(titles = c("emami paper mills slips 10% on dismal q4 numbers",
"jsw steel q4fy17 standalone net profit rises 173.33%",
"fmcg major hul q4fy17 standalone net profit rises 6.2",
"chennai petroleum, allsec tech slip 6-7% on poor q4"),stringsAsFactors = F)
names <- c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs","chennai petroleum corp ltd")
Regex:
library(dplyr)
library(stringr)
newnames <- gsub("^(\\w+).*","\\1",names)
regmat <- regmatches(df$titles,gregexpr(paste0(newnames,collapse="|"),df$titles))
regmat[lapply(regmat,length) == 0] <- NA
df <- data.frame(cbind(df,newnames =do.call("rbind",regmat)),stringsAsFactors = F)
df1 <- data.frame(names=names,newnames=newnames,stringsAsFactors = F)
left_join(df,df1,by="newnames")
You can also use stringr
library like below:
library(stringr)
newnames <- str_replace(names,"^(\\w+).*","\\1")
df$newnames <- str_extract(df$titles,paste0(newnames,collapse="|"))
df1 <- data.frame(names=names,newnames=newnames,stringsAsFactors = F)
left_join(df,df1,by="newnames")
output:
> left_join(df,df1,by="newnames")
titles newnames names
1 emami paper mills slips 10% on dismal q4 numbers emami emami ltd
2 jsw steel q4fy17 standalone net profit rises 173.33% jsw jsw steel ltd
3 fmcg major hul q4fy17 standalone net profit rises 6.2 hul hul india ltd
4 chennai petroleum, allsec tech slip 6-7% on poor q4 chennai chennai petroleum corp ltd
Upvotes: 2
Reputation: 686
To add to the previous answer I have made a function including some of the previous comments :
df <- data.frame(title=c("emami paper mills slips 10% on dismal q4 numbers",
"jsw steel q4fy17 standalone net profit rises 173.33%",
"fmcg major hul q4fy17 standalone net profit rises 6.2"))
names <- c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs")
find_string <- function(data,names){
### Clean the names
newnames <- gsub("^(\\w+).*","\\1",names)
### Loop over the names to find which sentence contain it
for(i in 1:length(newnames)){
if(length(grep(newnames[i],df$title)) != 0){
df$names[grep(newnames[i],df$title)] <- newnames[i]
}else{
print(paste(names[i],"not found in the data!"))
}
}
return(df)
}
### Run the function
find_string(df,names)
Hope this helps !
Upvotes: 0
Reputation: 14360
It's also possible to use sqldf
for this type of "fuzzy" merge.
Construct lookup:
names <- data.frame(name = c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs"))
names$lookup <- gsub("(\\w+).*", "\\1", names$name)
Perform merge:
library(sqldf)
res <- sqldf("SELECT l.*, r.name
FROM df as l
LEFT JOIN names as r
ON l.titles LIKE '%'||r.lookup||'%'")
A few notes: I extract the first word from the lookup since you said you want "hul"
only, and not "hul india"
. Also in sql
the ||
means concatenate and %
means wildcard (which will match anything), so this will match if any of the lookups appear anywhere in the text no matter what is before or after it.
Another option using Reduce
and then merging would be:
df$lookup <- Reduce( function(x, y) {x[grepl(y,x)] <- y; x}, c(list(df$titles), names$lookup))
merge(df, names)
Upvotes: 0