Liu
Liu

Reputation: 31

How do I match words in each row from a column to a data frame with words, in R?

I have a csv file with the column named Phrases, like below. And would like to assign an object to the specific phrase based on the object description from a data frame.

      [,1]
[1,] Phrases
[2,] sugar fluid
[3,] they are crispy
[4,] its soft

I've have a data frame with following

      [,1]                  [,2] 
[1,] Description,           Object
[2,] sweet and delicious,   apple
[3,] hard,                  nuts
[4,] wet and fluid,         water
[5,] sugar fluid,           coke
[6,] soft,                  marshmallow
[7,] crispy salty,          chips

The output should look like this

       [,1]             [,2] 
[1,] Phrases,         Object Assigned
[2,] sugar fluid,     coke
[3,] they are crispy, chips
[4,] its soft,        marshmallow

Notice how it may not be the exact phrase to description. As long as the object with the most matched words from its description is assigned to the phrase.

How do I do this?

Upvotes: 0

Views: 67

Answers (1)

Dan
Dan

Reputation: 12084

Here's a rough solution. First, I create the data frames. (For future reference: it helps a great deal if you provide the data in a copy-and-pastable format, such as using dput.)

# Create data frames
df_object <- structure(list(Description = c("sweet and delicious", "hard", 
                                            "wet and fluid", "sugar fluid", "soft", "crispy salty"), 
                            Object = c("apple", "nuts", "water", "coke", "marshmallow", "chips")), 
                       row.names = c(NA, -6L), class = c("data.frame"), 
                       .Names = c("Description",  "Object"))

df_phrases <- structure(list(Phrases = c("sugar fluid", "they are crispy", "its soft")), 
                        row.names = c(NA, -3L), class = c("data.frame"), 
                        .Names = "Phrases")

A quick peak at the data frames to make sure they're correct

# Examine data frames
df_object
#>           Description      Object
#> 1 sweet and delicious       apple
#> 2                hard        nuts
#> 3       wet and fluid       water
#> 4         sugar fluid        coke
#> 5                soft marshmallow
#> 6        crispy salty       chips

df_phrases
#>           Phrases
#> 1     sugar fluid
#> 2 they are crispy
#> 3        its soft

Next, is the meat of the solution.

  • I create a function qd that takes a phrase and compares it to the Description in df_object to find the most similar.
  • adist makes the comparison and provides a quantitative metric of the similarity.
  • which.min finds the smallest value returned by adist (i.e., the most similar).
  • The value returned by which.min is used to look up the corresponding Object.

# Quick & dirty function
qd<- function(phrase){
  with(df_object, Object[which.min(adist(phrase, Description, partial = TRUE))])
}

I then apply this to all Phrases and store the result in Obj_Assigned

# Apply 'qd' to 'Phrases' and store as 'Obj_Assigned'
df_phrases$Obj_Assigned <- sapply(df_phrases$Phrases, qd)

# Examine results
df_phrases
#>           Phrases Obj_Assigned
#> 1     sugar fluid         coke
#> 2 they are crispy        chips
#> 3        its soft  marshmallow

Created on 2019-12-03 by the reprex package (v0.2.1.9000)

The result is as requested. To call this approach flimsy is being generous. It's easy to break and not especially reliable, but works for your toy example.

Upvotes: 1

Related Questions