Reputation: 149
I always have had trouble with loops so I am asking here. 2 dataframes. 1 very large and 1 much smaller. Sample versions below.
Dataframe 1
ID Value
1 apples
1 apples
1 bananas
1 grapes
1 mangoes
1 oranges
1 grapes
1 apples
1 grapes
2 apples
2 apples
2 passionfruits
2 bananas
2 apples
2 apples
2 passionfruits
2 grapes
2 mangoes
2 apples
3 apples
3 bananas
3 oranges
3 apples
3 grapes
3 grapes
3 passionfruits
3 passionfruits
3 oranges
4 apples
4 oranges
4 mangoes
4 bananas
4 grapes
4 grapes
4 grapes
4 apples
4 oranges
4 grapes
4 mangoes
4 mangoes
4 apples
4 oranges
5 passionfruits
5 apples
5 oranges
5 oranges
5 mangoes
5 grapes
5 apples
5 bananas
Dataframe 2
Value
apples
apples
bananas
grapes
mangoes
mangoes
grapes
apples
apples
The different IDs in dataframe 1 are considered as sets. The dataframe 2 in its entirety will be an approximate or exact match to one of the sets. I know there is plenty of code to filter using the entire dataframe 2 to match with 1. But that is not what I require. I require it to filter sequentially value by value with conditions attached. The condition should be whether the previous value matches.
So in this example with the first value nothing happens because all IDs have 'apples'. The second value = 'apples' given that previous value='apples' filters out ID = 4 because it doesnt contain 'apples' occurring twice in a row. Now in the filtered dataframe 1 we search for the third value and so on. It stops only when 1 ID set remains in Dataframe 1. So in this case after the 3rd iteration. Result should be
Dataframe 1
ID Value
1 apples
1 apples
1 bananas
1 grapes
1 mangoes
1 oranges
1 grapes
1 apples
1 grapes
Upvotes: 0
Views: 86
Reputation: 13581
I suggest turning the Values
in each group into a string and comparing their string edit distance. adist
- Compute the approximate string distance between character vectors. The distance is a generalized Levenshtein (edit) distance, giving the minimal possibly weighted number of insertions, deletions and substitutions needed to transform one string into another.
string_edit_dist <- function(vec1, vec2) {
c(adist(paste0(vec1, collapse=""), paste0(vec2, collapse="")))
}
ind <- which.min(sapply(seq_along(unique(df1$ID)), function(i) string_edit_dist(df1$Value[df1$ID==i], df2$Value)))
df1[df1$ID==ind, ]
# ID Value
# 1 1 apples
# 2 1 apples
# 3 1 bananas
# 4 1 grapes
# 5 1 mangoes
# 6 1 oranges
# 7 1 grapes
# 8 1 apples
# 9 1 grapes
Here is the string_edit_distance
for each group
sapply(seq_along(unique(df1$ID)), function(i) string_edit_dist(df1$Value[df1$ID==i], df2$Value))
# 7 35 45 46 27
Upvotes: 0
Reputation: 20085
We can merge Value
for each ID
using a token separator (say #
) and then write a custom function that compare how many sequential tokens were matched. Finally, select data for ID
that has got the maximum match.
library(dplyr)
# This function matches and count tokens separated by `#`
# matched_count ("a#b#c","a#e#c#d") will return 1
matched_count <- function(x, y){
x_v <- strsplit(x, split = "#")[[1]]
y_v <- strsplit(y, split = "#")[[1]]
max_len <- max(length(x_v), length(y_v))
length(x_v) <- max_len
length(y_v) <- max_len
sum(x_v==y_v,na.rm = TRUE)
}
Dataframe1 %>% group_by(ID) %>%
mutate(CompStr = paste0(Value, collapse="#")) %>% #collapse values for ID
mutate(CompStrdf2 = paste0(Dataframe2$Value, collapse="#")) %>%
mutate(max_match = matched_count(CompStr, CompStrdf2)) %>%
ungroup() %>%
filter(max_match == max(max_match)) %>%
select(ID, Value)
# ID Value
# <int> <chr>
# 1 1 apples
# 2 1 apples
# 3 1 bananas
# 4 1 grapes
# 5 1 mangoes
# 6 1 oranges
# 7 1 grapes
# 8 1 apples
# 9 1 grapes
Upvotes: 0
Reputation: 83215
A possible approach with data.table
(an adaptation from my answer here):
# load packages
library(data.table)
# create a function which calculates match-score with 'df2$Value'
maxscore <- function(x, y) {
m <- mapply('==', shift(x, type = 'lead', n = 0:(length(y) - 1)), y)
max(rowSums(m, na.rm = TRUE))
}
# calculate the match-score for each group
# and filter out the other groups
setDT(df1)[, score := maxscore(Value, df2$Value), by = ID
][score == max(score)][, score := NULL][]
which gives:
ID Value 1: 1 apples 2: 1 apples 3: 1 bananas 4: 1 grapes 5: 1 mangoes 6: 1 oranges 7: 1 grapes 8: 1 apples 9: 1 grapes
You can use that function in a dplyr
-chain as well (but you will still need the data.table
-package for the shift
-function):
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(m = maxscore(Value, df2$Value)) %>%
ungroup() %>%
filter(m == max(m)) %>%
select(-m)
An alternative implementation of the maxscore
-function (inspired by @doscendo's answer here):
maxscore2 <- function(x, y) {
w <- which(x == y[1])
v <- sapply(w, function(i) sum(x[i:(i+(length(y)-1))] == y, na.rm = TRUE))
max(v)
}
Upvotes: 2