Reputation: 21400
I'm working on string distance in multi-word strings, as in this toy data:
df <- data.frame(
col1 = c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "wstjz")
)
I'd like to determine the (dis)similarity of each row compared to the next row on a word-by-word basis. I use this code:
library(dplyr)
library(tidyr)
library(stringdist)
df %>%
mutate(col2 = lead(col1, 1),
id = row_number()) %>%
pivot_longer(
# select columns:
cols = c(col1, col2),
# determine name of new column:
names_to = c(".value", "Col_N"),
# define capture groups (...) for new column:
names_pattern = "^([a-z]+)([0-9])$") %>%
# separate each word into its own row:
separate_rows(col, sep = "\\s") %>%
# recast into wider format:
pivot_wider(id_cols = c(id, Col_N),
names_from = Col_N,
values_from = col) %>%
# unnest lists:
unnest(.) %>%
# calculate string distance:
mutate(distance = stringdist(`1`, `2`)) %>%
group_by(id) %>%
# reconnect same-string words and distance values:
summarise(col1 = str_c(unique(`1`), collapse = " "),
col2 = str_c(unique(`2`), collapse = " "),
distance = str_c(distance, collapse = ", "))
# A tibble: 5 x 4
id col1 col2 distance
* <int> <chr> <chr> <chr>
1 1 ab ab bc 0, 2
2 2 ab bc yyyy 4, 4
3 3 yyyy yyyy pw hhhh 0, 4, 4
4 4 yyyy pw hhhh wstjz 5, 5, 5
5 5 wstjz NA NA
While the result seems to be okay, there are three problems with it: a) there are a number of warnings, b) the code seems quite convoluted, and c) distance
is of type character. So I'm wondering if there's a better way to determine word-by-word the (dis)similiarity of strings?
Upvotes: 1
Views: 199
Reputation: 5887
Without my comments below, just straightforward would be this.
library(data.table)
setDT(df)
df[, col1 := list(str_split(col1, " "))]
df[, col2 := lead(col1, 1)]
df[, distance := lapply(.I, function(x) { stringdist(col1[x][[1]], col2[x][[1]]) })]
Be carefull with any stringdist like function, on a huge dataset it is quite intense to make all comparisons. Also keep in mind what you are going to use the values distances for. Are you truly intestested in the disctance? Or are you interested in like all with a distance < x ? If so most likely a compared to axxxxxxxxxxxxxxx you do not consider a close match right, but you could see that difference by the length of the string for example which takes way less resources to calculate than the actual distance.
Also it would be a waste of computation to blindly compute row by row, lets just make a tiny longer sample set.
c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "yyyy", "yyyy pw hhhh", "wstjz", "wstjz")
here you would calculate 3x the disctance between yyyy and yyyy which should be done once (well actually you should capture those by "is equal" first), 3x yyyy and hhhh / hhhh and yyyy.
With a small dataset you probably do not have to worry, but with large sets and longer strings... it can become messy / slow pretty fast.
Upvotes: 0
Reputation: 611
A solution:
df <- data.frame(
col1 = col1 = c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "wstjz"),
stringsAsFactors=FALSE
)
comps = function(a.row){
paste(stringdist(unlist(strsplit(as.character(a.row[1]), ' ')),
unlist(strsplit(as.character(a.row[2]), ' '))),
collapse = ' ')
}
df %>%
mutate(col2 = lead(col1, 1)) %>%
mutate(distance = apply(., 1, comps))
as.character
in the strsplit
functionUpvotes: 2
Reputation: 2650
how about something like this:
mydf <- data.frame(
col1 = c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "wstjz")
)
mydf
library(dplyr)
library(stringdist)
mydf %>%
mutate(col1_lead = lead(col1)) %>%
apply(1, function(x){
stringdist(
unlist(strsplit(x["col1"], " ")),
unlist(strsplit(x["col1_lead"], " "))
)}
) %>%
cbind() %>%
`colnames<-`("distance") %>%
cbind(mydf)
Upvotes: 1
Reputation: 6776
Below is my simple honesty idea.
I make list-cols having words and calculate dist row by row with unlist
(because stringdist need vector).
And keep the dist as list-column.
ans <- df %>%
as_tibble() %>%
mutate(id = row_number(), # not use
col2 = lead(col1, 1),
sep_col1 = str_split(col1, " "),
sep_col2 = str_split(col2, " ")) %>% # or str_split(lead(col1, 1))
rowwise() %>%
mutate(dist = list(stringdist(unlist(sep_col1), unlist(sep_col2))),
for_just_look = paste(dist, collapse = ", ")) %>%
ungroup()
ans
# col1 id col2 sep_col1 sep_col2 dist for_just_look
# <chr> <int> <chr> <list> <list> <list> <chr>
# 1 ab 1 ab bc <chr [1]> <chr [2]> <dbl [2]> 0, 2
# 2 ab bc 2 yyyy <chr [2]> <chr [1]> <dbl [2]> 4, 4
# 3 yyyy 3 yyyy pw hhhh <chr [1]> <chr [3]> <dbl [3]> 0, 4, 4
# 4 yyyy pw hhhh 4 wstjz <chr [3]> <chr [1]> <dbl [3]> 5, 5, 5
# 5 wstjz 5 NA <chr [1]> <chr [1]> <dbl [1]> NA
Upvotes: 0