Reputation: 109

How to merge data frames where column1 is substring of column2

I have a data frame and would like to classify each row based on the value of column df$name. For the classification I have a two-column data frame tl with a column tl$name and tl$type. I would like to merge the two data frames on a like condition, grepl( tl$name, df$name ), instead of df$name = tl$name.

I have already tried by looping over all rows in df and seeing where there is a match with tl, but this seems very timeconsuming.

E.g.:

  name        
# African elephant    
# Indian elephant    
# Silverback gorilla     
# Nile crocodile   
# White shark

  name        type
# elephant    mammal
# gorilla     mammal
# crocodile   reptile
# shark       fish

Upvotes: 3

Answers (3)

sid

Reputation: 11

df

  name        
# African elephant    
# Indian elephant    
# Silverback gorilla     
# Nile crocodile   
# White shark       
tl

  name        type
# elephant    mammal
# gorilla     mammal
# crocodile   reptile
# shark       fish

I think this is what you want to do

df<-csplit(df, splitcols="name", sep=" ")

The above command will split that column into two columns with name.1 and name.2 column names.

colnames(df)<-c("name","type")

The above command will give proper column names for merging

df_tl<-merge(x=df, y=tl, by="type",all=True)

The above code should give you the desired output.

Upvotes: 0

Steven Beaupré

Reputation: 21641

Another idea:

library(tidyverse)

df %>%
  separate(name, into = c("t", "name")) %>%
  left_join(tl)

Which gives:

#           t      name    type
#1    African  elephant  mammal
#2     Indian  elephant  mammal
#3 Silverback   gorilla  mammal
#4       Nile crocodile reptile
#5      White     shark    fish

Upvotes: 1

akrun

Reputation: 887841

We can remove the substring with sub by matching one or more non-white space (\\S+) followed by one or more white space (\\s+) from the start (^) of the string, replace it with blank ("") and merge with the second dataset ('tl')

merge(transform(df, name = sub("^\\S+\\s+", "", name)), tl)
#      name    type
#1 crocodile reptile
#2  elephant  mammal
#3  elephant  mammal
#4   gorilla  mammal
#5     shark    fish

If we need to update the first dataset,

df$type <- with(df, tl$type[match(sub("^\\S+\\s+", "", name), tl$name)])

Upvotes: 0

How to merge data frames where column1 is substring of column2

Answers (3)

Related Questions