rick
rick

Reputation: 1115

ruby regex, everything BUT what is in parentheses and brackets

i am trying to write a regex that produces the content in a string that is NOT in parentheses or brackets. The parentheses is always a year, and the brackets could contain any normal characters, upper and lower case. i was going about it by finding the brackets and parentheses and then [^\regex] to escape it (is this right?)

here's the string:

s = 'Some words (1999) [THINGS]

and the regex:

/[^(\(\d{4}\))|\[.*\]]/

but this includes the characters inside the brackets see (http://rubular.com/r/bbpcnnGgCI)

everything works up until adding the [^\regex]

for example, this works to get (1999):

>> puts s.match(/\(\d{4}\)/)
(1999)  

and for whats in brackets:

>> puts s.match(/\[.*\]/)
[THINGS]

but put them together using | for "or":

>> puts s.match(/\(\d{4}\)|\[.*\]/)
(1999)

...it just matches the parentheses and its contents.

what's going on here?

what am i doing wrong here?

Upvotes: 1

Views: 7348

Answers (4)

Karl Knechtel
Karl Knechtel

Reputation: 61519

(\(\d{4}\))|\[.*\] means "four digits surrounded in parentheses, and also captured in a group; or anything between square brackets".

[^...] does not mean "anything that isn't matched by ...". [] sets up a character-set, which if it starts with ^ is negated. [^(\(\d{4}\))|\[.*\]] means "a character that is not an open parenthesis or an open parenthesis or a digit or an open brace or a 4 or a close brace or a close parenthesis or a close parenthesis or a pipe or an open square bracket or a period or a star or a close square bracket".

You want to match "any text that is not in parentheses or brackets". This is not easily expressed as a regex directly. What you really want to do is split the string using "any parenthesized or bracketed item" as a delimiter.

I don't know the ruby syntax, but in Python this looks like:

import re

pattern = re.compile(r"(?:\[[^\]]*\])|(?:\(\d{4}*\))")

pattern.split('Some words (1999) [THINGS]') # ['Some words ', ' ', '']

That gives you the individual pieces, assuming you need them. If you're just going to join them up again, then the "replace the delimiters with empty strings" (i.e. gsub) approach works just fine.

Upvotes: 3

Bohemian
Bohemian

Reputation: 425033

What about looking at this from the opposite direction: Try replacing the pattern \(\d{4}\) with blank "", then you'll have what you want:

s.gsub("\(\d{4}\)", "")

EDITED: To incorporate syntax correction suggested by @rick (thx @rick!)

Upvotes: 0

David Tuite
David Tuite

Reputation: 22643

Try this /\(.+/ which will match everything from the opening ( onwards. If you strip that out, you're left with 'Some words' which should be what you need?

Two points

  1. I may be misunderstanding the question
  2. You need something more complicated if there's any possibility of an ( appearing earlier in the string.

By the way, I find this rather valuable when trying to come up with Regex patterns.

Edit This pattern should only match stuff in brackets even if there is a stray bracket earlier in the string.

string.gsub(/(\(|\[).+(\)|\])/, '')

Upvotes: 5

cordsen
cordsen

Reputation: 1701

if you need something that matches multiple sets of brackets in a string mixed with words this will work http://rubular.com/r/rvcO4TyBLq

((\(\d{4}\))|(\[[^\]]+\]))+

Upvotes: 0

Related Questions