Vivin Paliath
Vivin Paliath

Reputation: 95598

Splitting a string according to a delimiter when elements in the string can contain the delimiter

I have a string that looks like this:

"#Text() #SomeMoreText() #TextThatContainsDelimiter(#blah) #SomethingElse()"

I'd like to get back

[#Text(), #SomeMoreText(), #TextThatContainsDelimiter(#blah), #SomethingElse()]

One way I thought about doing this was to require that the # to be escaped into \#, which makes the input string:

"#Text() #SomeMoreText() #TextThatContainsDelimiter(\#blah) #SomethingElse()"

I can then split it using /[^\\]#/ which gives me:

[#Text(), SomeMoreText, TextThatContainsDelimiter(\#blah), SomethingElse()]

The first element will contain # but I can strip it out. However, is there a cleaner way to do this without having to escape the #, and which ensures that the first element will not contain a #? Basically I'd like it to split by # only if the # is not enclosed by parentheses.

My hunch is that since the # is context-sensitive and and regular expressions are only suited for context-free strings, this may not be the right tool. If so, would I have to write a grammar for this and roll my own parser/lexer?

Upvotes: 2

Views: 392

Answers (2)

Alan Moore
Alan Moore

Reputation: 75272

From your example, it looks like you want to split on whitespace that's immediately followed by a hash symbol:

/\s+(?=#)/

That leaves the leading # on all the tokens, but you won't need to treat the first token specially. You could also use this:

/(?:^|\s+)#/

That would strip the hash symbols at the cost of generating an empty string as the first token. But some languages provide a way to discard empty leading tokens. Note that JavaScript does support lookaheads, just not lookbehinds.

Upvotes: 2

Joey
Joey

Reputation: 354864

Argh! I tend to lose my abilities here. The regex (?<!\()(?=#) works

PS Home:\> $s -split '(?<!\()(?=#)'

#Text()
#SomeMoreText()
#TextThatContainsDelimiter(#blah)
#SomethingElse()

This combines a negative lookbehind (to make sure there isn't an opening parenthesis preceding the #) and a positive lookahead to look for the #.

Upvotes: 2

Related Questions