Tyler Oliver
Tyler Oliver

Reputation: 1

pegjs regular expression match words up until a word from a collection of words is found

I am using the pegjs parser generator for a project and I am having difficulty creating a grammar that should match all words up until a collection of words that it should not match. as an example in the string "the door is yellow" I want to be able to match all words up until is, tell the pegjs parser to start parsing from the word is. The collection of words I want to the parser to break on are "is" "has" and "of".

current grammar rule is as follows:

subject "sub" = 
s:[a-zA-Z ]+ { return s.join("").trim()}

How can i create a look ahead that stops the parser from including my collection on words?

(!of|is|has)

Upvotes: 0

Views: 837

Answers (2)

Joe Hildebrand
Joe Hildebrand

Reputation: 10414

I know this question was asked 5 years ago, but I'm just running through cleaning up unanswered questions in the [pegjs] tag.

This seems to work, and you just need to replace postfix with your further processing rule.

subject "sub" =  prefix:prefix breakWord:breakWord postfix:postfix "\n"? {
  return { prefix: prefix, breakWord, postfix }
}

prefix = $(!breakWord .)* { return text().trim() }
postfix = [^\n]* { return text().trim() }

breakWord
  = "is"
  / "has"
  / "of"

which generates this with an input of "the door is yellow":

{ prefix: "the door", breakWord: "is", postfix: "yellow" }

Note a couple of things:

  • The form (!breakWord .) is a little slow; it looks ahead to make sure the current input doesn't begin with any of the words in the breakWord set of alternates -- for each character in the prefix.
  • If you have break words that start with a common set of characters (e.g. "is" and "isn't"), make sure the longer word is first in the breakWord rule.
  • The current postfix rule assumes that a newline might terminate the input.

Upvotes: 1

RedLaser
RedLaser

Reputation: 680

This will work

.+(?=\s+(of|is|has))

It matches one or more of any characters (except line breaks) until it encounters either 'of', 'is', or 'has' (via a positive lookahead) with white space before them.

Upvotes: -1

Related Questions