Regex to split a vary string into groupdict

Question

I want to extract some information from my data.

the most full row maybe like below(each parts may contain CJK character):

0. (event) (tag) [group (artist)] title (form) [addition1] [addition2]

one row also may be:

1. (event) [group (artist)] title (form) [addition1]

2. [event] [group (artist)] title (form) (addition1)

3. (tag) [group (artist)] title

4. [group (artist)] title

5. title

6. and something like above, such as 【tag】 [group (artist)] title 【form】

As we see, the most simple row is just plain text title, I write a regex try to match all of them

import re
regex_patern = ur'([$\[](?P[^$\]]*)[\)\]])?\s*([$\[](?P[^$\](\)\])]*)[\)\]])?\s*($$(?P[^\($$]*)($(?P[^$]*)\))?\])?(?P[^]*)([$\[](?P<from>[^$\]]*)[\)\]])?(\s*[$\[](?P<more1>[^$\]]*)[\)\]])'

p = re.compile(regex_patern)

rows= [
'(event) (tag) [group (artist)] title (form) [addition1] [addition2]',
'(event) [group (artist)] title (form) [addition1]',
'[event] [group (artist)] title (form) (addition1)',
'(tag) [group (artist)] title',
'[group (artist)] title',
'title',
]

for r in rows:
    r = re.search(p, r)
    print r.groupdict()
</code></pre>

<p>output:</p>

<pre><code>{u'from': 'form', u'more1': 'addition1', u'artist': 'artist', u'title': ' title ', u'group': 'group ', u'type': 'tag', u'event': 'event'}
{u'from': 'form', u'more1': 'addition1', u'artist': 'artist', u'title': ' title ', u'group': 'group ', u'type': None, u'event': 'event'}
{u'from': 'form', u'more1': 'addition1', u'artist': 'artist', u'title': ' title ', u'group': 'group ', u'type': None, u'event': 'event'}
{u'from': None, u'more1': 'group (artist', u'artist': None, u'title': '', u'group': None, u'type': None, u'event': 'tag'}
{u'from': None, u'more1': 'group (artist', u'artist': None, u'title': '', u'group': None, u'type': None, u'event': None}
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-831c548bc3f0> in <module>()
     15 for r in rows:
     16     r = re.search(p, r)
---> 17     print r.groupdict()

AttributeError: 'NoneType' object has no attribute 'groupdict'
</code></pre>

<p>The result become unexpected from row 4. <br>
I think <code>re</code> should search from middle.  first look for <code>[group (artist)] and title</code>, but I don't know how to write in regex.
Or I am doing the wrong way?</p>

m.cekiera · Accepted Answer

EDIT

It seem (at least on sample you provide) you can correctly match and group whole string with:

^(?:(?:^[$$()](?P[^)$$]+)[)\]](?=.+[\])]$)\s)?(?:[(【](?P(?<=^[(【])[^】)]+(?=.+[\w】]$)|(?<=\)\s$)[^)]+(?=$\s$$))[】)]\s)?\[(?:(?P[^($$]+)\s+$(?P[^)]+)$\])\s+)?(?P[^(\n)【]+)(?:\s*[$【](?P<form>[^)】]+)[)】](?:\s*[\[(](?P<add>[^\])]+)[\])])?(?:\s*[\[(](?P<add2>[^\])]+)[\])])?)?$
</code></pre>

<p><a href="https://regex101.com/r/nJ0lB9/3" rel="nofollow">DEMO</a></p>

<p>used in:</p>

<pre><code>import re

rows= [
'(event) (tag) [group (artist)] title (form) [addition1] [addition2]',
'(event) [group (artist)] title (form) [addition1]',
'[event] [group (artist)] title (form) (addition1)',
'(tag) [group (artist)] title',
'[group (artist)] title',
'title',
]

p = re.compile(ur'^(?:(?:^[\[()](?P<event>[^)\]]+)[)\]](?=.+[\])]$)\s)?(?:[(【](?P<tag>(?<=^[(【])[^】)]+(?=.+[\w】]$)|(?<=$\s$)[^)]+(?=$\s$$))[】)]\s)?\[(?:(?P<group>[^($$]+)\s+$(?P<artist>[^)]+)$\])\s+)?(?P<title>[^(\n)【]+)(?:\s*[\(【](?P<form>[^)】]+)[)】](?:\s*[$$(](?P<add>[^$$)]+)[\])])?(?:\s*[$$(](?P<add2>[^$$)]+)[\])])?)?$')

for r in rows:
    [m.groupdict() for m in p.finditer(r)]
    print m.groupdict()
</code></pre>

<p>gives output:</p>

<pre><code>{u'event': 'event', u'tag': 'tag', u'group': 'group', u'artist': 'artist', u'title': 'title ', u'form': 'form', u'add': 'addition1', u'add2': 'addition2'} 
{u'event': 'event', u'tag': None, u'group': 'group', u'artist': 'artist', u'title': 'title ', u'form': 'form', u'add': 'addition1', u'add2': None} 
{u'event': 'event', u'tag': None, u'group': 'group', u'artist': 'artist', u'title': 'title ', u'form': 'form', u'add': 'addition1', u'add2': None} 
{u'event': None, u'tag': 'tag', u'group': 'group', u'artist': 'artist', u'title': 'title', u'form': None, u'add': None, u'add2': None} 
{u'event': None, u'tag': None, u'group': 'group', u'artist': 'artist', u'title': 'title', u'form': None, u'add': None, u'add2': None} 
{u'event': None, u'tag': None, u'group': None, u'artist': None, u'title': 'title', u'form': None, u'add': None, u'add2': None}
</code></pre>

<p><a href="https://ideone.com/qJ9bjm" rel="nofollow">DEMO</a></p>

<p>This regex is composed of couple parts:</p>

<ul>
<li><code>(?:^[$$()](?P<event>[^)$$]+)[)\]](?=.+[\])]$)\s)?</code> - matching events</li>
<li><code>(?:[(【](?P<tag>(?<=^[(【])[^】)]+(?=.+[\w】]$)|(?<=\)\s$)[^)]+(?=$\s$$))[】)]\s)?</code> - matching tags</li>
<li><code>\[(?:(?P<group>[^($$]+)\s+$(?P<artist>[^)]+)$\])\s+)?</code> - matching groups</li>
<li><code>(?P<title>[^(\n)【]+)</code> - matching title</li>
<li><code>(?:\s*[\(【](?P<form>[^)】]+)[)】](?:\s*[$$(](?P<add>[^$$)]+)[\])])?(?:\s*[$$(](?P<add2>[^$$)]+)[\])])?)?</code> - matching form and adds</li>
</ul>

<p>As you can see, every part, excluding part matching a <code>title</code>, ends with <code>?</code> quantifier, which means zero or one. Because of that, these part are optional, it will match if there is fragment to match, but if not, it will not disturb (at least it should not) how rest of regex will work. This is why it seems like it match "from a middle", not "from left to right".</p>

Regex to split a vary string into groupdict

Answers (1)

Related Questions