Nix
Nix

Reputation: 149

String matching character by character in R

I have this dataframe containing rows of strings and IDs. I call this history.

history
ID   string
1.1  a b b b c c s d s ....
1.2  a b b b b c s s d ....
2.1  a c c s s d b d b ....
2.2  a c s c s d b d b ....
3.1  a z z x d b d d f ....
3.2  a z x z d d f b d ....
...

The strings in each row are quite long. IDs belonging to the same number like 1.1 and 1.2 have similar strings with minor differences. 1.1 and 2.2 though will have larger differences between them. There are around 70 rows in the original data.

test
string
a c c c s s d b d b....

My goal is that if i have another dataframe containing a string that isnt present in "history" I would like to find what ID in "history" it matches best to. I know there are plenty of text matching methods that can do this. My issue comes from the fact that I cannot match the whole string in "test" against "history".

The whole point of this is to see if i can figure out which ID the string in "test" belongs to without having to match the whole string. One idea i've thought of is to filter out history as we make more matches in test.

My expected output: Here I am assuming the match starts with the first character of the string in "test "against first characters of the strings in "history". And we go character by character. Both these assumptions are not fixed. Also the lengths of the strings in "history" and "test" can be different.

The first character "a" from "test" is a match to all in "history". So no filteration happens in this case.

test
string
a 

result:

history
    ID   string
    1.1  a b b b c c s d s ....
    1.2  a b b b b c s s d ....
    2.1  a c c s s d b d b ....
    2.2  a c s c s d b d b ....
    3.1  a z z x d b d d f ....
    3.2  a z x z d d f b d ....
    ...

The second character is "c". Here to make sure we are not matching a random "c" from somewhere in "history" I think establishing a rule will be helpful. Something like a match occurs if "a" then "c".

test
    string
    a c

result:

history
    ID   string
    2.1  a c c s s d b d b ....
    2.2  a c s c s d b d b ....

This has already narrowed down the match to history ID 2.1 and 2.2. Frankly we can even stop here as I said before the differences between those two are minute. SO in conclusion as soon as history has been filtered down to just one ID it should output which ID had the best match to the "test" string.

Upvotes: 1

Views: 715

Answers (2)

M.R.
M.R.

Reputation: 36

Building on top of the excellent example given above by AntoniosK:

You could apply some weighting factor for each column. So if column 1 is very important then multiply that with 10.000 and the second column only with 1.000. Then sum the values row-wise and find the highest sum to get the best fitting strings.

(Aka a b c d e f matches a b c x x x better than a b x d e f does)

library(tidyverse)
df = data.frame(ID = c(1.1,1.2,2.1,2.2,3.1,3.2),
                string = c("a b b b c c s d s",
                           "a b b b b c s s d",
                           "a c c s s d b d b",
                           "a c s c s d b d b",
                           "a z z x d b d d f",
                           "a z x z d d f b d"), 
                stringsAsFactors = F)
# string to test
test <-  "a c c c s s"

weights <- c(1000,100,10,10,10,10,10,10,10)

df_answer <- df %>%
  separate_rows(string) %>%
  group_by(ID) %>%
  mutate(test = unlist(strsplit(test, split = " "))[row_number()]) %>% 
  mutate(scores = (string == test) * weights) %>% 
  summarise(scores = sum(scores, na.rm = TRUE)) %>%
  filter(scores == max(scores))

# A tibble: 2 x 2
#     ID scores
#  <dbl>  <dbl>
#1   2.1   1120
#2   2.2   1120

Upvotes: 0

AntoniosK
AntoniosK

Reputation: 16121

Here are two tidyverse solutions that will return the ID value(s) with the maximum number of matches with your test string and the number of matches:

df = data.frame(ID = c(1.1,1.2,2.1,2.2,3.1,3.2),
                string = c("a b b b c c s d s",
                           "a b b b b c s s d",
                           "a c c s s d b d b",
                           "a c s c s d b d b",
                           "a z z x d b d d f",
                           "a z x z d d f b d"), 
                stringsAsFactors = F)

library(tidyverse)

# string to test
test = "a c c c s s"

Option 1 (considering matches at any position)

df %>%
  separate_rows(string) %>%
  group_by(ID) %>%
  mutate(test = unlist(strsplit(test, split = " "))[row_number()]) %>%
  na.omit() %>%
  summarise(matches = sum(string == test)) %>%
  filter(matches == max(matches))

# # A tibble: 2 x 2
#      ID matches
#   <dbl>   <int>
# 1   2.1       4
# 2   2.2       4

Option 2 (considering consecutive matches)

df %>%
  separate_rows(string) %>%
  group_by(ID) %>%
  mutate(test = unlist(strsplit(test, split = " "))[row_number()]) %>%
  na.omit() %>%
  summarise(matches = sum(cumprod(string == test))) %>%
  filter(matches == max(matches))

# # A tibble: 1 x 2
#        ID matches
#     <dbl>   <dbl>
#   1   2.1       3

Upvotes: 1

Related Questions