Antoine
Antoine

Reputation: 1719

grepl in R: matching impeded by intra-word dashes

I have 3 words: x, y, and z, from which two compound words can be built: x-y, and y-z.

In naturally occuring text, x, y, and z can follow each other. In the first case, I have:

text="x-y z"

And I want to detect: "x-y" but not "y z". If I do:

v=c("x-y","y z")
vv=paste("\\b",v,"\\b",sep="")
sapply(vv,grepl,text,perl=TRUE)

I get c(TRUE,TRUE). In other words, grepl does not capture the fact that y is already linked to x via the intra-word dash, and that therefore, "y z" is not actually there in the text. So I use a lookbehind after adding whitespace at the beginning of the text:

text=paste("",text,sep=" ")
vv=paste("(?<= )\\b",v,"\\b",sep="")
sapply(vv,grepl,text,perl=TRUE)

this time, I get what I want: c(TRUE, FALSE). Now, in the second case, I have:

text="x y-z"

and I want to detect "y-z" but not "x y". Adopting a symmetrical approach with a lookahead this time, I tried:

text=paste(text,"",sep=" ")
v=c("x y","y-z")
vv=paste("(?= )\\b",v,"\\b",sep="")
sapply(vv,grepl,text,perl=TRUE)

But this time I get c(FALSE,FALSE) instead of c(FALSE,TRUE) as I was expecting. The FALSE in first position is expected (the lookahead detected the presence of the intra-word dash after y and prevented matching with "x y"). But I really do not understand what is preventing the matching with "y-z".

Thanks a lot in advance for your help,

Upvotes: 1

Views: 183

Answers (1)

Eric
Eric

Reputation: 3473

I think this matches the description in your comment of what you want to accomplish.

spaceInvader <- function(a, b, text) {
  # look ahead of `a` to see if there is a space
  hasa <- grepl(paste0(a, '(?= )'), text, perl = TRUE)
  # look behind `b` to see if there is a space 
  hasb <- grepl(paste0('(?<= )', b), text, perl = TRUE)

  result <- c(hasa, hasb)
  names(result) <- c(a, b)
  cat('In: "', text, '"\n', sep = '')
  return(result)
}

spaceInvader('x-y', 'y z', 'x-y z')
# In: "x-y z"
#   x-y   y z 
#  TRUE FALSE 
spaceInvader('x y', 'y-z', 'x y-z')
# In: "x y-z"
#   x y   y-z 
# FALSE  TRUE 
spaceInvader('x-y', 'y z', 'x y-z')
# In: "x y-z"
#   x-y   y z 
# FALSE FALSE 
spaceInvader('x y', 'y-z', 'x-y z')
# In: "x-y z"
#   x y   y-z 
# FALSE FALSE 

Is this a problem?

spaceInvader('x-y', 'y-z', 'x-y-z')
# In: "x-y-z"
#   x-y   y-z 
# FALSE FALSE

Upvotes: 1

Related Questions