Reputation: 401
I have 2 datasets containing similar string vectors (product titles). The only difference between the strings in both datasets is absence/presence of special characters.
Now, my problem is to match the corresponding string vectors and return the non-matching element(s) (which should be special character(s) in each case). There can be many non-matching special characters in a single string.
For e.g. I have 2 texts:
Text 1: Analog Science Fiction and Fact February 1995
Text 2: Analog Science Fiction and Fact, February 1995
Is there an R function to return the non-matching element(s) only?
This is how I approached the problem
S.vector <- strsplit(Acceptdata['Text.1'][1,],' ')
S.vector
# [[1]]
# [1] "Analog" "Science" "Fiction" "and" "Fact" "February" "1995"
F.vector <- strsplit(Acceptdata['Text.2'][1,],' ')
F.vector
# [[1]]
# [1] "Analog" "Science" "Fiction" "and" "Fact," "February" "1995"
l.S.vector <- tolower(S.vector)
l.F.vector <- tolower(F.vector)
grep("l.S.vector",l.F.vector,invert=T,value=T)
# [1] "c(\"analog\", \"science\", \"fiction\", \"and\", \"fact,\", \"february\", \"1995\")"
Any help is greatly appreciated.
When I'm trying to run the algorithm for the entire dataset(~500 vectors) its throwing an error as is.character(a) is not TRUE.
The procedure I followed:
common <- function(a,b) {
for (i in seq_along(a))
for (j in seq_along(b))
i2 <- strsplit(tolower(i),'')
j2 <- strsplit(tolower(j),'')
if(length(i2) < length(j2)) {
i2[(length(i2)+1):length(j2)] <- ' '
} else if(length(i2) > length(j2)) {
b2[(length(b2)+1):length(a2)] <- ' '
}
LCS(i2,j2)
}
z <- common(a,b)
Error: is.character(a) is not TRUE
Any idea on where did I go wrong?
Upvotes: 1
Views: 687
Reputation: 44565
I'm totally clear on your intended output, but I think this will help you get there. It uses the LCS
function from the qualV package.
library("qualV")
common <- function(a,b) {
a2 <- strsplit(a,'')[[1]]
b2 <- strsplit(b,'')[[1]]
if(length(a2) < length(b2)) {
a2[(length(a2)+1):length(b2)] <- ' '
} else if(length(a2) > length(b2)) {
b2[(length(b2)+1):length(a2)] <- ' '
}
LCS(a2,b2)
}
Here's an example using your two strings:
a <- 'Analog Science Fiction and Fact February 1995'
b <- 'Analog Science Fiction and Fact, February 1995'
z <- common(a,b)
paste0(z$LCS, collapse = '') # common string
# [1] "Analog Science Fiction and Fact February 1995"
z$b[which(!seq(1,max(z$vb)) %in% z$vb)] # non-matching elements in `b`
# [1] ","
z$a[which(!seq(1,max(z$va)) %in% z$va)] # non-matching elements in `a`
# character(0)
Here's an example using two strings that have more differences:
a <- 'Analog! SCIENCE Fiction and Fact Feb. 1995'
b <- 'Analog Science Fiction & Fact (February 1995)'
z <- common(a,b)
paste0(z$LCS, collapse = '') # common string
# [1] "Analog S Fiction Fact Feb 1995"
z$b[which(!seq(1,max(z$vb)) %in% z$vb)] # non-matching elements in `b`
# [1] "c" "i" "e" "n" "c" "e" "&" "(" "r" "u" "a" "r" "y"
z$a[which(!seq(1,max(z$va)) %in% z$va)] # non-matching elements in `a`
# [1] "!" "C" "I" "E" "N" "C" "E" "a" "n" "d" "."
Upvotes: 1