Reputation: 635
I have a character vector
var1 <- c("pine tree", "dense forest", "red fruits", "green fruits",
"clean water", "pine")
and a list
var2 <- list(c("tall tree", "fruits", "star"), c("tree tall", "pine tree",
"tree pine", "black forest", "water"), c("apple", "orange", "grapes"))
I want to match words in var1 with elements in var2, and obtain the ranked elements of var2. For example, the desired output here is:
"tree tall" "pine tree" "tree pine" "black forest" "water"
var2[2] is rank 1 (4 phrases in var1: pine tree, dense forest, pine, and water matches with var2[2]
"tall tree" "fruits" "star"
var2[1] is rank 2, (3 phrases in var1: pine tree, red fruits, and green fruits matches with var2[1])
"apple" "orange" "grapes"
var2[3] is rank 3 which has no match with var1
I tried
indx1 <- sapply(var2, function(x) sum(grepl(var1, x)))
without getting the output desired.
How to solve it? A code snippet would be appreciated. Thanks.
EDIT:
The new data is given below:
var11 <- c("nature" , "environmental", "ringing", "valley" , "status" , "climate" ,
"forge" , "environmental" , "common" ,
"birdwatch", "big" , "link" ,
"day" , "pintail" , "morning" ,
"big garden" , "birdwatch deadline", "deadline february" ,
"mu condition" , "garden birdwatch" , "status" ,
"chorus walk" , "dawn choru" , "walk sunday",
"climate lobby" , "lobby parliament" , "u status" ,
"sandwell valley" , "my status of" , "environmental lake")
var22 <- list(c("environmental condition"), c("condition", "status"), c("water", "ocean water"))
Upvotes: 1
Views: 269
Reputation: 887128
We can loop over 'var2' (sapply(var2,
) , split the strings at white space (strsplit(x, ' ')
), grep
the output list elements as pattern for 'var1'. Check if there is any
match, sum
the logical vector and rank
it. This can be used for reordering the 'var2' elements.
indx <- rank(-sapply(var2, function(x) sum(sapply(strsplit(x, ' '),
function(y) any(grepl(paste(y,collapse='|'), var1))))),
ties.method='first')
indx
#[1] 2 1 3
var2[indx]
#[[1]]
#[1] "tree tall" "pine tree" "tree pine" "black forest" "water"
#[[2]]
#[1] "tall tree" "fruits" "star"
#[[3]]
#[1] "apple" "orange" "grapes"
If we need to count the duplicates as well, try
indx <- rank(-sapply(var22, function(x) sum(sapply(strsplit(x, ' '),
function(y) sum(sapply(strsplit(var11, ' '),
function(z) any(grepl(paste(y, collapse="|"), z))))))),
ties.method='random')
indx
#[1] 1 2
If we need to filter out the elements in 'var2' that don't have any match with 'var1'
pat <- paste(unique(unlist(strsplit(var1, ' '))), collapse="|")
Filter(function(x) any(grepl(pat, x)), var2[indx])
#[[1]]
#[1] "tree tall" "pine tree" "tree pine" "black forest" "water"
#[[2]]
#[1] "tall tree" "fruits" "star"
Upvotes: 1
Reputation: 710
The following code could work:
idx <- rank(-sapply(var2,
function(x) sum(unlist(sapply(strsplit(var1,split=' '),
function(y) any(unlist(sapply(y,
function(z) grepl(z,x))>0))>0)))),
ties.method='random')
Upvotes: 0