Reputation: 149
I have this dataframe containing rows of strings and IDs. I call this history.
history
ID string
1.1 a b b b c c s d s ....
1.2 a b b b b c s s d ....
2.1 a c c s s d b d b ....
2.2 a c s c s d b d b ....
3.1 a z z x d b d d f ....
3.2 a z x z d d f b d ....
...
The strings in each row are quite long. IDs belonging to the same number like 1.1 and 1.2 have similar strings with minor differences. 1.1 and 2.2 though will have larger differences between them. There are around 70 rows in the original data.
test
string
a c c c s s d b d b....
My goal is that if i have another dataframe containing a string that isnt present in "history" I would like to find what ID in "history" it matches best to. I know there are plenty of text matching methods that can do this. My issue comes from the fact that I cannot match the whole string in "test" against "history".
The whole point of this is to see if i can figure out which ID the string in "test" belongs to without having to match the whole string. One idea i've thought of is to filter out history as we make more matches in test.
My expected output: Here I am assuming the match starts with the first character of the string in "test "against first characters of the strings in "history". And we go character by character. Both these assumptions are not fixed. Also the lengths of the strings in "history" and "test" can be different.
The first character "a" from "test" is a match to all in "history". So no filteration happens in this case.
test
string
a
result:
history
ID string
1.1 a b b b c c s d s ....
1.2 a b b b b c s s d ....
2.1 a c c s s d b d b ....
2.2 a c s c s d b d b ....
3.1 a z z x d b d d f ....
3.2 a z x z d d f b d ....
...
The second character is "c". Here to make sure we are not matching a random "c" from somewhere in "history" I think establishing a rule will be helpful. Something like a match occurs if "a" then "c".
test
string
a c
result:
history
ID string
2.1 a c c s s d b d b ....
2.2 a c s c s d b d b ....
This has already narrowed down the match to history ID 2.1 and 2.2. Frankly we can even stop here as I said before the differences between those two are minute. SO in conclusion as soon as history has been filtered down to just one ID it should output which ID had the best match to the "test" string.
Upvotes: 1
Views: 715
Reputation: 36
Building on top of the excellent example given above by AntoniosK:
You could apply some weighting factor for each column. So if column 1 is very important then multiply that with 10.000 and the second column only with 1.000. Then sum the values row-wise and find the highest sum to get the best fitting strings.
(Aka a b c d e f matches a b c x x x better than a b x d e f does)
library(tidyverse)
df = data.frame(ID = c(1.1,1.2,2.1,2.2,3.1,3.2),
string = c("a b b b c c s d s",
"a b b b b c s s d",
"a c c s s d b d b",
"a c s c s d b d b",
"a z z x d b d d f",
"a z x z d d f b d"),
stringsAsFactors = F)
# string to test
test <- "a c c c s s"
weights <- c(1000,100,10,10,10,10,10,10,10)
df_answer <- df %>%
separate_rows(string) %>%
group_by(ID) %>%
mutate(test = unlist(strsplit(test, split = " "))[row_number()]) %>%
mutate(scores = (string == test) * weights) %>%
summarise(scores = sum(scores, na.rm = TRUE)) %>%
filter(scores == max(scores))
# A tibble: 2 x 2
# ID scores
# <dbl> <dbl>
#1 2.1 1120
#2 2.2 1120
Upvotes: 0
Reputation: 16121
Here are two tidyverse
solutions that will return the ID
value(s) with the maximum number of matches with your test string and the number of matches:
df = data.frame(ID = c(1.1,1.2,2.1,2.2,3.1,3.2),
string = c("a b b b c c s d s",
"a b b b b c s s d",
"a c c s s d b d b",
"a c s c s d b d b",
"a z z x d b d d f",
"a z x z d d f b d"),
stringsAsFactors = F)
library(tidyverse)
# string to test
test = "a c c c s s"
Option 1 (considering matches at any position)
df %>%
separate_rows(string) %>%
group_by(ID) %>%
mutate(test = unlist(strsplit(test, split = " "))[row_number()]) %>%
na.omit() %>%
summarise(matches = sum(string == test)) %>%
filter(matches == max(matches))
# # A tibble: 2 x 2
# ID matches
# <dbl> <int>
# 1 2.1 4
# 2 2.2 4
Option 2 (considering consecutive matches)
df %>%
separate_rows(string) %>%
group_by(ID) %>%
mutate(test = unlist(strsplit(test, split = " "))[row_number()]) %>%
na.omit() %>%
summarise(matches = sum(cumprod(string == test))) %>%
filter(matches == max(matches))
# # A tibble: 1 x 2
# ID matches
# <dbl> <dbl>
# 1 2.1 3
Upvotes: 1