Paulo Freitas
Paulo Freitas

Reputation: 13649

Regex to ignore unwanted matched quotes

I'm trying to split several key-value lines with regular expression in Python. The file I'm working have more than 1.2M lines, so I created another one with a few lines that suits all different key-value's occurrences I need to care about:

@=""
@="0"
@="="
@="@"
@="k=\"v\""
@=dword:00000000
@=hex:00
"k"=""
"k"="="
"k"="@"
"k"="k=\"v\""
"k"="v"
"k"=dword:00000000
"k"=hex:00
"k=\"v\""=""
"k=\"v\""="="
"k=\"v\""="@"
"k=\"v\""="k=\"v\""
"k=\"v\""="v"
"k=\"v\""=dword:00000000
"k=\"v\""=hex:00

I'm already doing the job with a fairly simple look-behind/look-ahead regex that works like a charm:

#!/usr/bin/env python
import re
regex = re.compile(r'(?<=@|")=(?=[dh"])')

for line in open('split-test'):
    line = line.strip()
    key, value = regex.split(line, 1)

    if key != '@':
        key = key[1:-1]

print '{} => {}'.format(key, value)

Output:

@ => ""
@ => "0"
@ => "="
@ => "@"
@ => "k=\"v\""
@ => dword:00000000
@ => hex:00
k => ""
k => "="
k => "@"
k => "k=\"v\""
k => "v"
k => dword:00000000
k => hex:00
k=\"v\" => ""
k=\"v\" => "="
k=\"v\" => "@"
k=\"v\" => "k=\"v\""
k=\"v\" => "v"
k=\"v\" => dword:00000000
k=\"v\" => hex:00

As you can see, in the code flow I'll have to strip the leading and trailing quotes from the key part. That said, I've to state that I'm not trying to optimize anything, I'm just trying to learn how I can achieve the same results with the regular expression itself.

I've tried many changes in the above original code, and I successfully got a new horrible-and-slow-but-working regexp with the following code:

#!/usr/bin/env python
import re
regex = re.compile(r'(?:(@)|(?:"((?:(?:[^"\\]+)|\\.)*)"))=')

for line in open('split-test'):
    line = line.strip()
    key, value = filter(None, regex.split(line))

    print '{} => {}'.format(key, value)

Here I'd have to use filter() 'cause it matches some empty strings. I'm not a regular expression master, so I'm just wondering any better written regex that would do this job.

Upvotes: 2

Views: 311

Answers (3)

Alan Moore
Alan Moore

Reputation: 75272

So you want a regex that scrapes off the quotes in the process of matching? Check this out:

r'^(")?((?(1)[^"\\]*(?:\\.[^"\\]*)*|@))"?=([dh"].+$)'

If the first character is a quote, it gets captured in group #1, the (1) condition succeeds, and the YES branch of the conditional consumes everything up to the next unescaped quote (but not the quote itself). If not, the NO branch tries to match @. Either way, the key gets captured in group #2, without enclosing quotes.

The rest of the regex is straightforward: it consumes the trailing quote (if there is one) and the =, and then the rest of the string gets captured in group #3. Note that it could match malformed inputs that start with @" or "". If that's not acceptable, you can add a lookahead to validate the format before the actual matching starts. I didn't bother because the extra clutter would get in the way of explaining the core technique.

^
(")?
(
  (?(1)
    [^"\\]*(?:\\.[^"\\]*)*
    |
    @
  )
)
"?
=
([dh"].+$)

Upvotes: 1

user557597
user557597

Reputation:

I think you're on the right track with your last regex that tries to resolve quotes. This uses capture buffers instead of split.

There are two ways to go.

Assume quotations are imperfect (unbalanced) -

 #  ^((?:"[^"\\]*(?:\\.[^"\\]*)*"|.)*)=((?:"[^"\\]*(?:\\.[^"\\]*)*"|[^=])*)$

 ^
 (                         # (1 start)
      (?:
           "
           [^"\\]* 
           (?: \\ . [^"\\]* )*
           "
        |  .
      )*
 )                         # (1 end)
 =
 (                         # (2 start)
      (?:
           "
           [^"\\]* 
           (?: \\ . [^"\\]* )*
           "
        |  [^=]
      )*
 )                         # (2 end)
 $

or, assume they are perfect -

 #  ^((?:"[^"\\]*(?:\\.[^"\\]*)*"|[^"])*)=((?:"[^"\\]*(?:\\.[^"\\]*)*"|[^="])*)$

 ^
 (                         # (1 start)
      (?:
           "
           [^"\\]* 
           (?: \\ . [^"\\]* )*
           "
        |  [^"]
      )*
 )                         # (1 end)
 =
 (                         # (2 start)
      (?:
           "
           [^"\\]* 
           (?: \\ . [^"\\]* )*
           "
        |  [^="]
      )*
 )                         # (2 end)
 $

Upvotes: 1

Tryph
Tryph

Reputation: 6219

may this do the trick:

#!/usr/bin/env python
import re
string = r"""@=""
@="0"
@="="
@="@"
@="k=\"v\""
@=dword:00000000
@=hex:00
"k"=""
"k"="="
"k"="@"
"k"="k=\"v\""
"k"="v"
"k"=dword:00000000
"k"=hex:00
"k=\"v\""=""
"k=\"v\""="="
"k=\"v\""="@"
"k=\"v\""="k=\"v\""
"k=\"v\""="v"
"k=\"v\""=dword:00000000
"k=\"v\""=hex:00
"""
regex = re.compile(r'("?)(.*)\1=(["hd].+)')

results = regex.findall(string)
for _, key, value in results:
    print '{} => {}'.format(key, value)

it gives the result below:

@ => ""
@ => "0"
@ => "="
@ => "@"
@ => "k=\"v\""
@ => dword:00000000
@ => hex:00
k => ""
k => "="
k => "@"
k => "k=\"v\""
k => "v"
k => dword:00000000
k => hex:00
k=\"v\" => ""
k=\"v\" => "="
k=\"v\" => "@"
k=\"v\" => "k=\"v\""
k=\"v\" => "v"
k=\"v\" => dword:00000000
k=\"v\" => hex:00

Upvotes: 3

Related Questions