Nix
Nix

Reputation: 149

Filter dataframe using a second dataframe value by value

I always have had trouble with loops so I am asking here. 2 dataframes. 1 very large and 1 much smaller. Sample versions below.

Dataframe 1

ID        Value
1         apples
1         apples
1         bananas
1         grapes
1         mangoes
1         oranges
1         grapes
1         apples
1         grapes
2         apples
2         apples
2         passionfruits
2         bananas
2         apples
2         apples
2         passionfruits
2         grapes
2         mangoes
2         apples
3         apples
3         bananas
3         oranges
3         apples
3         grapes
3         grapes
3         passionfruits
3         passionfruits
3         oranges
4         apples
4         oranges
4         mangoes
4         bananas
4         grapes
4         grapes
4         grapes
4         apples
4         oranges
4         grapes
4         mangoes
4         mangoes
4         apples
4         oranges
5         passionfruits
5         apples
5         oranges
5         oranges
5         mangoes
5         grapes
5         apples
5         bananas

Dataframe 2

Value
apples
apples
bananas
grapes
mangoes
mangoes
grapes
apples
apples

The different IDs in dataframe 1 are considered as sets. The dataframe 2 in its entirety will be an approximate or exact match to one of the sets. I know there is plenty of code to filter using the entire dataframe 2 to match with 1. But that is not what I require. I require it to filter sequentially value by value with conditions attached. The condition should be whether the previous value matches.

So in this example with the first value nothing happens because all IDs have 'apples'. The second value = 'apples' given that previous value='apples' filters out ID = 4 because it doesnt contain 'apples' occurring twice in a row. Now in the filtered dataframe 1 we search for the third value and so on. It stops only when 1 ID set remains in Dataframe 1. So in this case after the 3rd iteration. Result should be

Dataframe 1

ID        Value
1         apples
1         apples
1         bananas
1         grapes
1         mangoes
1         oranges
1         grapes
1         apples
1         grapes

Upvotes: 0

Views: 86

Answers (3)

CPak
CPak

Reputation: 13581

I suggest turning the Values in each group into a string and comparing their string edit distance. adist - Compute the approximate string distance between character vectors. The distance is a generalized Levenshtein (edit) distance, giving the minimal possibly weighted number of insertions, deletions and substitutions needed to transform one string into another.

string_edit_dist <- function(vec1, vec2) {
    c(adist(paste0(vec1, collapse=""), paste0(vec2, collapse="")))
}    

ind <- which.min(sapply(seq_along(unique(df1$ID)), function(i) string_edit_dist(df1$Value[df1$ID==i], df2$Value)))
df1[df1$ID==ind, ]

  # ID   Value
# 1  1  apples
# 2  1  apples
# 3  1 bananas
# 4  1  grapes
# 5  1 mangoes
# 6  1 oranges
# 7  1  grapes
# 8  1  apples
# 9  1  grapes

Here is the string_edit_distance for each group

sapply(seq_along(unique(df1$ID)), function(i) string_edit_dist(df1$Value[df1$ID==i], df2$Value))
# 7 35 45 46 27

Upvotes: 0

MKR
MKR

Reputation: 20085

We can merge Value for each ID using a token separator (say #) and then write a custom function that compare how many sequential tokens were matched. Finally, select data for ID that has got the maximum match.

library(dplyr)

# This function matches and count tokens separated by `#`
# matched_count ("a#b#c","a#e#c#d")  will return 1
matched_count <- function(x, y){
  x_v <- strsplit(x, split = "#")[[1]]
  y_v <- strsplit(y, split = "#")[[1]]
  max_len <- max(length(x_v), length(y_v))
  length(x_v) <- max_len
  length(y_v) <- max_len
  sum(x_v==y_v,na.rm = TRUE)
}    


Dataframe1 %>% group_by(ID) %>%
  mutate(CompStr = paste0(Value, collapse="#")) %>% #collapse values for ID
  mutate(CompStrdf2 = paste0(Dataframe2$Value, collapse="#")) %>% 
  mutate(max_match = matched_count(CompStr, CompStrdf2)) %>%
  ungroup() %>%
  filter(max_match == max(max_match)) %>%
  select(ID, Value)

# ID Value  
# <int> <chr>  
# 1     1 apples 
# 2     1 apples 
# 3     1 bananas
# 4     1 grapes 
# 5     1 mangoes
# 6     1 oranges
# 7     1 grapes 
# 8     1 apples 
# 9     1 grapes 

Upvotes: 0

Jaap
Jaap

Reputation: 83215

A possible approach with data.table (an adaptation from my answer here):

# load packages
library(data.table)

# create a function which calculates match-score with 'df2$Value'
maxscore <- function(x, y) {
  m <- mapply('==', shift(x, type = 'lead', n = 0:(length(y) - 1)), y)
  max(rowSums(m, na.rm = TRUE))
}

# calculate the match-score for each group
# and filter out the other groups
setDT(df1)[, score := maxscore(Value, df2$Value), by = ID
           ][score == max(score)][, score := NULL][]

which gives:

   ID   Value
1:  1  apples
2:  1  apples
3:  1 bananas
4:  1  grapes
5:  1 mangoes
6:  1 oranges
7:  1  grapes
8:  1  apples
9:  1  grapes

You can use that function in a dplyr-chain as well (but you will still need the data.table-package for the shift-function):

library(dplyr)
df1 %>% 
  group_by(ID) %>% 
  mutate(m = maxscore(Value, df2$Value)) %>% 
  ungroup() %>% 
  filter(m == max(m)) %>% 
  select(-m)

An alternative implementation of the maxscore-function (inspired by @doscendo's answer here):

maxscore2 <- function(x, y) {
  w <- which(x == y[1])
  v <- sapply(w, function(i) sum(x[i:(i+(length(y)-1))] == y, na.rm = TRUE))
  max(v)
}

Upvotes: 2

Related Questions