Stefan
Stefan

Reputation: 9599

Splitting a complex string with Regular Expressions

How do I, using a regular expression, split this string:

string = "a[a=d b&c[e[100&2=34]]] e[cheese=blue and white] x[a=a b]"

into this array:

string.split( regexp ) =>

[ "a[a=d b&c[e[100&2=34]]]", "e[cheese=blue and white]", "x[a=a b]" ]

The basic rule is that string should be split at whitespace ( \s ), unless whitespace exists inside brackets( [ ] );

Upvotes: 1

Views: 357

Answers (4)

Dolphin
Dolphin

Reputation: 4772

If the rule is this simple, I would suggest just doing it manually. Step through each character and keep track of your nesting level by increasing by 1 for each [ and decreasing by 1 for each ]. If you reach a space with nesting == 0 then split.

Edit: I was thinking that I might also mention that there are other pattern matching facilities in some languages that do natively support this sort of thing. For example, in Lua you can use '%b[]' to match balanced nested []'s. (Of course, Lua doesn't have a built in split function....)

Upvotes: 4

Mike Tunnicliffe
Mike Tunnicliffe

Reputation: 10772

Another is a looping approach where you deconstruct the nested brackets one level at a time, else it's hard(TM) to ensure your single regexp will work as expected.

Here's an example in ruby:

str = "a[a=d b&c[e[100&2=34]]] e[cheese=blue and white] x[a=a b]"
left = str.dup
tokn=0
toks=[]
# Deconstruct
loop do
  left.sub!(/\[[^\]\[]*\]/,"\{#{tokn}\}")
  break if $~.nil?
  toks[tokn]=$&
  tokn+=1
end
left=left.split(/\s+/)
# Reconstruct
(toks.size-1).downto(0) do |tokn|
  left.each { |str| str.sub!("\{#{tokn}\}", toks[tokn]) }
end

The above uses {n} where n is an integer during the deconstruction, so in some cases original input like this in the string would break the reconstruction. This should illustrate the approach though.

Writing code that does the split by iterating through the characters is simpler and safer though.

Example in ruby:

str = "a[a=d b&c[e[100&2=34]]] e[cheese=blue and white] x[a=a b]"
toks=[]
level=st=en=0; 
str.each_byte do |c|
  en+=1; 
  level+=1 if c=='['[0]; 
  level-=1 if c==']'[0]; 
  if level==0 && c==' '[0]
    toks.push(str[st,en-1-st]);
    st=en
  end
end    
toks.push(str[st,en-st]) if st!=en 
p toks

Upvotes: 0

Thomas Cowart
Thomas Cowart

Reputation:

could you split on "(?<=])\s(?=[a-z][)"? that is, a space preceeded by a ] and followed by a letter and a [? This assumes you never have any string inside brackets like "a[b=d[x=y b] g[w=v b]]"

Upvotes: 0

Aaron Digulla
Aaron Digulla

Reputation: 328774

You can't; regular expressions are based on state machines which don't have a "stack" so you can remember the number of nesting levels.

But maybe you can use a trick: Try to convert the string into a valid JSON string. Then you can use eval() to parse it into a JavaScript object.

Upvotes: 5

Related Questions