Reputation: 14305

Tokenization using regexp in Python

I try tokenize a string like "spam bar ds<hai bye>sd baz eggs" into a list ['spam', 'bar', 'ds<hai bye>sd', 'baz', 'eggs'], i.e. like str.split() but preserving whitespace inside < ... >.

My solution was to use re.split with (\S*<.*?>\S*)|\s+ pattern. However I get the following:

>>> re.split('(\S*<.*?>\S*)|\s+', "spam bar ds<hai bye>sd baz eggs")
['spam', None, 'bar', None, '', 'ds<hai bye>sd', '', None, 'baz', None, 'eggs']

Not sure where are those Nones and empty strings are coming from. I, of course, can filter them out with a list comprehension [s for s in result if s], but I'm not comfortable doing that before I know the reason.

So, (1) why those Nones and empty strings, (2) could it be done better?

Upvotes: 2

Answers (3)

eyquem

Reputation: 27585

I got this regex:

ss = "spam bar ds<hai bye>sd baz eggs ZQ<boo <abv> foo>WX  "

reg = re.compile('(?:'
                     '\S*?'
                     '<'
                     '[^<>]*?'
                     '(?:<[^<>]*>[^<>]*)*'
                     '[^<>]*?'
                     '>'
                       ')?'
                 '\S+')

print reg.findall(ss)

result

['spam', 'bar', 'ds<hai bye>sd', 'baz', 'eggs',
 'ZQ<boo <abv> foo>WX']

EDIT 1

A new regex, more accurate, after Cartroo's comment:

import re

pat = ('(?<!\S)'  # absence of non-whitespace before

       '(?:'
           '[^\s<>]+'

           '|'  # OR

           '(?:[^\s<>]*)'
           '(?:'
               '<'
               '[^<>]*?'
               '(?:<[^<>]*?>[^<>]*)*'
               '[^<>]*?'
               '>'
               ')'
           '(?:[^\s<>]*)'
       ')'

       '(?!\S)' # absence of non-whitespace after)
       )
reg = re.compile(pat)

ss = ("spam i>j bar ds<hai bye>sd baz eggs Z<boo <abv>"
      " foo>W ttt <two<;>*<:> three> ")
print '%s\n' % ss
print reg.findall(ss)

ss = "a<b<E1>c>d <b<E2>c>d <b<E3>c> a<<E4>c>d <<E5>>d 
   <<E6>> <<>>"
print '\n\n%s\n' % ss
print reg.findall(ss)

result

spam i>j bar ds<hai bye>sd baz eggs Z<boo <abv> foo>W 
ttt <two<;>*<:> three> 

['spam', 'bar', 'ds<hai bye>sd', 'baz', 'eggs', 
 'Z<boo <abv> foo>W', 'ttt', '<two<;>*<:> three>']


a<b<E1>c>d <b<E2>c>d <b<E3>c> a<<E4>c>d <<E5>>d <<E6>> <<>>

['a<b<E1>c>d', '<b<E2>c>d', '<b<E3>c>', 'a<<E4>c>d', '<<E5>>d',
 '<<E6>>', '<<>>']

The above strings were well formed and the results are consistent.
On a non-well-formed text (regarding the brackets), it may give non-desired results:

ss = """A<B<C>D  
 E<F<G>H 
I<J>K> 
 L<<M>N
   O<P>>Q
 R<<S>    T<<>"""
print '\n\n%s\n' % ss
print reg.findall(ss)

result

A<B<C>D  
 E<F<G>H 
I<J>K> 
 L<<M>N
   O<P>>Q
 R<<S>    T<<>

['E<F<G>H \nI<J>K>', 'L<<M>N\n   O<P>>Q']

That's because of the star at the end of '(?:<[^<>]*?>[^<>]*)*'. This behavior can be turned off by removing the star. This behavior is what makes it difficult to use regexes for analyzing such "convoluted" texts, as called by Crtaroo.

EDIT 2

When I said that the results 'E<F<G>H \nI<J>K>' and 'L<<M>N\n O<P>>Q' are non desired ones, it did't mean that the matching portions found are not respecting the regex's pattern (how could it be ?) as I crafted it; the matching portions are well formed, indeed:
two portions <G> and <J> are between two brackets < <G> <J> >
two portions <M> and <P> are between two brackets < <M> <P> >

In fact it was an understatement that implies that each matching portion found should extend in only one line. But as soon as an understatement is explicited, a possible solution emerges.
If matching portions extending on several lines are not desired, it's easy to tell to the regex to not match them, contrary to what I wrote. It suffices to add character \n at some places in the regex's pattern.

In fact, it means that the matching portions must not pass over a \n character and then this character can be considered as a separator of the matching portions. Hence, any other character can be wanted as a separator between matching portions present on the same line, for example # in the following code.

Regexes can't cook or fetch the kids from the school, but they are extremely powerful. Saying that behavior of a regex on a malformed text is an issue is too short: one must adds that it's an issue of the text, not the regex. A regex does what it is ordered to do: eating any text that is given to it. And it voraciously eat it, that is, without verifying any conformity about it, it is not an intended behaviour from it, and then it isn't responsible if it is fed with an undietetic text. Saying that behavior of a regex on malformed text is an issue sounds as if someone would reproach a kid to be sometimes nourished with whisky and peppered food.

It's of the responsability of the coder to ensure that the text passed to a regex is well formed. In the same way that a coder puts verification snippet in a code to ensure that the entries are integers in order that a program runs correctly.

This point is different from the misuse of regexes when one tries to parse a marked-up text as an XML one. Regexes are unable to parse such a text, OK, because it's impossible to craft a regex that will react correctly on a malformed marked-up text. It's also the responsability of the coder to not try to do that.
That doesn't mean that regexes must not be employed to analyze a marked-up text if this text has been validated.
Anyway, even a parser will not catch data if a text is too much malformed.

I mean that we must distinguish:

the nature of the text passed to a regex (malformed / well formed)
the nature of the pursued aim when using a regex (parsing / analyzing)

import re

ss = """
 A<:<11>:<12>:>
 fgh
 A<#:<33>:<34>:>
 A#<:<55>:<56>:>
 A<:<77>:<78> i<j>
 A<B<C>D #
 E<F<G>H #
 I<J>K> 
 L<<M>N 
 O<P>>Q  #
 R<<S>  T<<>"""
print '%s\n' % ss

pat = ('(?<!\S)'  # absence of non-whitespace before
           '(?:[^\s<>]*)'
           '(?:<'
               '[^<>]*?'
               '(?:<[^<>]*?>[^<>]*)*'
               '>)'
           '(?:[^\s<>]*)'
       '(?!\S)' # absence of non-whitespace after)
       )
reg = re.compile(pat)
print '------------------------------'
print '\n'.join(map(repr,reg.findall(ss)))


pat = ('(?<!\S)'  # absence of non-whitespace before
           '(?:[^\s<>]*)'
           '(?:<'
               '[^<>\n]*?'
               '(?:<[^<>\n]*?>[^<>\n]*)*'
               '>)'
           '(?:[^\s<>]*)'
       '(?!\S)' # absence of non-whitespace after)
       )
reg = re.compile(pat)
print '\n----------- with \\n -------------'
print '\n'.join(map(repr,reg.findall(ss)))


pat = ('(?<!\S)'  # absence of non-whitespace before
           '(?:[^\s<>]*)'
           '(?:<'
               '[^<>#]*?'
               '(?:<[^<>#]*?>[^<>#]*)*'
               '>)'
           '(?:[^\s<>]*)'
       '(?!\S)' # absence of non-whitespace after)
       )
reg = re.compile(pat)
print '\n------------- with # -----------'
print '\n'.join(map(repr,reg.findall(ss)))


pat = ('(?<!\S)'  # absence of non-whitespace before
           '(?:[^\s<>#]*)'
           '(?:<'
               '[^<>#]*?'
               '(?:<[^<>#]*?>[^<>#]*)*'
               '>)'
           '(?:[^\s<>]*)'
       '(?!\S)' # absence of non-whitespace after)
       )
reg = re.compile(pat)
print '\n------ with ^# everywhere -------'
print '\n'.join(map(repr,reg.findall(ss)))

result

 A<:<11>:<12>:>
 fgh
 A<#:<33>:<34>:>
 A#<:<55>:<56>:>
 A<:<77>:<78> i<j>
 A<B<C>D #
 E<F<G>H #
 I<J>K> 
 L<<M>N 
 O<P>>Q  #
 R<<S>  T<<>

------------------------------
'A<:<11>:<12>:>'
'A<#:<33>:<34>:>'
'A#<:<55>:<56>:>'
'i<j>'
'E<F<G>H #\n I<J>K>'
'L<<M>N \n O<P>>Q'

----------- with \n -------------
'A<:<11>:<12>:>'
'A<#:<33>:<34>:>'
'A#<:<55>:<56>:>'
'i<j>'

------------- with # -----------
'A<:<11>:<12>:>'
'A#<:<55>:<56>:>'
'i<j>'
'L<<M>N \n O<P>>Q'

------ with ^# everywhere -------
'A<:<11>:<12>:>'
'i<j>'
'L<<M>N \n O<P>>Q'

Upvotes: 1

Cartroo

Reputation: 4343

The None and empty string values are because you've used capturing brackets in your pattern, so the split is including matched text - see the official documentation for mention of this.

If you amend your pattern to r"((?:\S*<.*?>\S*)|\S+") (i.e. escaping the brackets to make then non-capturing and correcting the whitespace to a non-whitespace) it should work, but only by keeping the delimiters, which you then need to filter out by skipping alternate items. I think you're better off with this:

ITEM_RE = re.compile(r"(?:\S*<.*?>\S*)|\S+")
ITEM_RE.findall("spam bar ds<hai bye>sd baz eggs")

If you don't need an actual list (i.e. you only go through them one item at a time) then finditer() is more efficient as it only yields them one at a time. This is especially true if you're likely to bail out without going through the whole list.

It might also be possible in principle with a negative lookbehind assertion, but in practice I don't think it's possible to create one flexible enough - I tried r"(?<!<[^>]*)\s+" and got the error "look-behind requires fixed-width pattern", so I guess that's a no-no. The docs corroborate this - lookbehind assertions (both positive and negative) all need to be fixed width.

The issue with this approach is going to be if you expect nested angle brackets - then you're going to not get what you expect. For example, parsing ds<hai <bye> foo>sd will yield ds<hai <bye> as one token. I think this is the class of problem that regular expressions can't address - you need something closer to a proper parser. It wouldn't be hard to write one in pure Python which goes through character at a time and counts nesting levels of brackets, but that'll be quite slow. Depends whether you can be sure you'll only see one level of nesting in your input.

Upvotes: 3

neilr8133

Reputation: 152

I believe the None values are due to the presence of ()s in the pattern based on this line from the documentation:

If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list

Using the Regex Tester on your input may also help visualize the parsing: http://regexpal.com/?flags=g&regex=%28\S*%3C.*%3F%3E\S*%29|\s%2B&input=spam%20bar%20ds%3Chai%20bye%3Esd%20baz%20eggs

Upvotes: 0

Tokenization using regexp in Python

Answers (3)

EDIT 1

EDIT 2

Related Questions