Reputation: 1719
I have 3 words: x, y, and z, from which two compound words can be built: x-y, and y-z.
In naturally occuring text, x, y, and z can follow each other. In the first case, I have:
text="x-y z"
And I want to detect: "x-y" but not "y z". If I do:
v=c("x-y","y z")
vv=paste("\\b",v,"\\b",sep="")
sapply(vv,grepl,text,perl=TRUE)
I get c(TRUE,TRUE). In other words, grepl does not capture the fact that y is already linked to x via the intra-word dash, and that therefore, "y z" is not actually there in the text. So I use a lookbehind after adding whitespace at the beginning of the text:
text=paste("",text,sep=" ")
vv=paste("(?<= )\\b",v,"\\b",sep="")
sapply(vv,grepl,text,perl=TRUE)
this time, I get what I want: c(TRUE, FALSE). Now, in the second case, I have:
text="x y-z"
and I want to detect "y-z" but not "x y". Adopting a symmetrical approach with a lookahead this time, I tried:
text=paste(text,"",sep=" ")
v=c("x y","y-z")
vv=paste("(?= )\\b",v,"\\b",sep="")
sapply(vv,grepl,text,perl=TRUE)
But this time I get c(FALSE,FALSE) instead of c(FALSE,TRUE) as I was expecting. The FALSE in first position is expected (the lookahead detected the presence of the intra-word dash after y and prevented matching with "x y"). But I really do not understand what is preventing the matching with "y-z".
Thanks a lot in advance for your help,
Upvotes: 1
Views: 183
Reputation: 3473
I think this matches the description in your comment of what you want to accomplish.
spaceInvader <- function(a, b, text) {
# look ahead of `a` to see if there is a space
hasa <- grepl(paste0(a, '(?= )'), text, perl = TRUE)
# look behind `b` to see if there is a space
hasb <- grepl(paste0('(?<= )', b), text, perl = TRUE)
result <- c(hasa, hasb)
names(result) <- c(a, b)
cat('In: "', text, '"\n', sep = '')
return(result)
}
spaceInvader('x-y', 'y z', 'x-y z')
# In: "x-y z"
# x-y y z
# TRUE FALSE
spaceInvader('x y', 'y-z', 'x y-z')
# In: "x y-z"
# x y y-z
# FALSE TRUE
spaceInvader('x-y', 'y z', 'x y-z')
# In: "x y-z"
# x-y y z
# FALSE FALSE
spaceInvader('x y', 'y-z', 'x-y z')
# In: "x-y z"
# x y y-z
# FALSE FALSE
Is this a problem?
spaceInvader('x-y', 'y-z', 'x-y-z')
# In: "x-y-z"
# x-y y-z
# FALSE FALSE
Upvotes: 1