R character match and rank

Question

I have a character vector

var1 <- c("pine tree", "dense forest", "red fruits", "green fruits",
                 "clean water", "pine")

and a list

var2 <- list(c("tall tree", "fruits", "star"),  c("tree tall", "pine tree",
  "tree pine", "black forest", "water"), c("apple", "orange", "grapes"))

I want to match words in var1 with elements in var2, and obtain the ranked elements of var2. For example, the desired output here is:

"tree tall"    "pine tree"    "tree pine"    "black forest" "water"

var2[2] is rank 1 (4 phrases in var1: pine tree, dense forest, pine, and water matches with var2[2]

"tall tree" "fruits"    "star"

var2[1] is rank 2, (3 phrases in var1: pine tree, red fruits, and green fruits matches with var2[1])

 "apple"  "orange" "grapes"

var2[3] is rank 3 which has no match with var1

I tried

indx1 <- sapply(var2, function(x) sum(grepl(var1, x)))

without getting the output desired.

How to solve it? A code snippet would be appreciated. Thanks.

EDIT:

The new data is given below:

var11 <- c("nature" ,  "environmental", "ringing", "valley" ,            "status" ,            "climate" ,          
       "forge"  ,            "environmental" ,     "common" ,           
       "birdwatch",          "big"    ,            "link" ,             
       "day" ,              "pintail"    ,        "morning" ,          
       "big garden" ,        "birdwatch deadline", "deadline february" ,
       "mu condition" ,        "garden birdwatch" ,  "status" ,           
       "chorus walk" ,       "dawn choru"  ,       "walk sunday", 
       "climate lobby" ,     "lobby parliament" ,  "u status" ,              
       "sandwell valley" ,   "my status of"  ,           "environmental lake")


var22 <- list(c("environmental condition"),  c("condition", "status"), c("water", "ocean water"))

akrun · Accepted Answer

We can loop over 'var2' (sapply(var2,) , split the strings at white space (strsplit(x, ' ')), grep the output list elements as pattern for 'var1'. Check if there is any match, sum the logical vector and rank it. This can be used for reordering the 'var2' elements.

 indx <- rank(-sapply(var2, function(x) sum(sapply(strsplit(x, ' '),
              function(y) any(grepl(paste(y,collapse='|'), var1))))),
                 ties.method='first')
 indx
 #[1] 2 1 3


var2[indx]
#[[1]]
#[1] "tree tall"    "pine tree"    "tree pine"    "black forest" "water"       

#[[2]]
#[1] "tall tree" "fruits"    "star"     

#[[3]]
#[1] "apple"  "orange" "grapes"

Update

If we need to count the duplicates as well, try

indx <- rank(-sapply(var22, function(x) sum(sapply(strsplit(x, ' '), 
        function(y) sum(sapply(strsplit(var11, ' '), 
          function(z) any(grepl(paste(y, collapse="|"), z))))))),
             ties.method='random')
indx
#[1] 1 2

Update2

If we need to filter out the elements in 'var2' that don't have any match with 'var1'

pat <- paste(unique(unlist(strsplit(var1, ' '))), collapse="|")
Filter(function(x) any(grepl(pat, x)), var2[indx])
#[[1]]
#[1] "tree tall"    "pine tree"    "tree pine"    "black forest" "water"       

#[[2]]
#[1] "tall tree" "fruits"    "star"

R character match and rank

Answers (2)

Update

Update2

Related Questions