Reputation: 13778
I want to extract some information from my data.
the most full row maybe like below(each parts may contain CJK character):
0. (event) (tag) [group (artist)] title (form) [addition1] [addition2]
one row also may be:
1. (event) [group (artist)] title (form) [addition1]
2. [event] [group (artist)] title (form) (addition1)
3. (tag) [group (artist)] title
4. [group (artist)] title
5. title
6. and something like above, such as 【tag】 [group (artist)] title 【form】
As we see, the most simple row is just plain text title
,
I write a regex try to match all of them
import re
regex_patern = ur'([\(\[](?P<event>[^\)\]]*)[\)\]])?\s*([\(\[](?P<type>[^\)\](\)\])]*)[\)\]])?\s*(\[(?P<group>[^\(\]]*)(\((?P<artist>[^\)]*)\))?\])?(?P<title>[^\(\)\[\]]*)([\(\[](?P<from>[^\)\]]*)[\)\]])?(\s*[\(\[](?P<more1>[^\)\]]*)[\)\]])'
p = re.compile(regex_patern)
rows= [
'(event) (tag) [group (artist)] title (form) [addition1] [addition2]',
'(event) [group (artist)] title (form) [addition1]',
'[event] [group (artist)] title (form) (addition1)',
'(tag) [group (artist)] title',
'[group (artist)] title',
'title',
]
for r in rows:
r = re.search(p, r)
print r.groupdict()
output:
{u'from': 'form', u'more1': 'addition1', u'artist': 'artist', u'title': ' title ', u'group': 'group ', u'type': 'tag', u'event': 'event'}
{u'from': 'form', u'more1': 'addition1', u'artist': 'artist', u'title': ' title ', u'group': 'group ', u'type': None, u'event': 'event'}
{u'from': 'form', u'more1': 'addition1', u'artist': 'artist', u'title': ' title ', u'group': 'group ', u'type': None, u'event': 'event'}
{u'from': None, u'more1': 'group (artist', u'artist': None, u'title': '', u'group': None, u'type': None, u'event': 'tag'}
{u'from': None, u'more1': 'group (artist', u'artist': None, u'title': '', u'group': None, u'type': None, u'event': None}
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-5-831c548bc3f0> in <module>()
15 for r in rows:
16 r = re.search(p, r)
---> 17 print r.groupdict()
AttributeError: 'NoneType' object has no attribute 'groupdict'
The result become unexpected from row 4.
I think re
should search from middle. first look for [group (artist)] and title
, but I don't know how to write in regex.
Or I am doing the wrong way?
Upvotes: 1
Views: 145
Reputation: 5395
EDIT
It seem (at least on sample you provide) you can correctly match and group whole string with:
^(?:(?:^[\[()](?P<event>[^)\]]+)[)\]](?=.+[\])]$)\s)?(?:[(【](?P<tag>(?<=^[(【])[^】)]+(?=.+[\w】]$)|(?<=\)\s\()[^)]+(?=\)\s\[))[】)]\s)?\[(?:(?P<group>[^(\]]+)\s+\((?P<artist>[^)]+)\)\])\s+)?(?P<title>[^(\n)【]+)(?:\s*[\(【](?P<form>[^)】]+)[)】](?:\s*[\[(](?P<add>[^\])]+)[\])])?(?:\s*[\[(](?P<add2>[^\])]+)[\])])?)?$
used in:
import re
rows= [
'(event) (tag) [group (artist)] title (form) [addition1] [addition2]',
'(event) [group (artist)] title (form) [addition1]',
'[event] [group (artist)] title (form) (addition1)',
'(tag) [group (artist)] title',
'[group (artist)] title',
'title',
]
p = re.compile(ur'^(?:(?:^[\[()](?P<event>[^)\]]+)[)\]](?=.+[\])]$)\s)?(?:[(【](?P<tag>(?<=^[(【])[^】)]+(?=.+[\w】]$)|(?<=\)\s\()[^)]+(?=\)\s\[))[】)]\s)?\[(?:(?P<group>[^(\]]+)\s+\((?P<artist>[^)]+)\)\])\s+)?(?P<title>[^(\n)【]+)(?:\s*[\(【](?P<form>[^)】]+)[)】](?:\s*[\[(](?P<add>[^\])]+)[\])])?(?:\s*[\[(](?P<add2>[^\])]+)[\])])?)?$')
for r in rows:
[m.groupdict() for m in p.finditer(r)]
print m.groupdict()
gives output:
{u'event': 'event', u'tag': 'tag', u'group': 'group', u'artist': 'artist', u'title': 'title ', u'form': 'form', u'add': 'addition1', u'add2': 'addition2'}
{u'event': 'event', u'tag': None, u'group': 'group', u'artist': 'artist', u'title': 'title ', u'form': 'form', u'add': 'addition1', u'add2': None}
{u'event': 'event', u'tag': None, u'group': 'group', u'artist': 'artist', u'title': 'title ', u'form': 'form', u'add': 'addition1', u'add2': None}
{u'event': None, u'tag': 'tag', u'group': 'group', u'artist': 'artist', u'title': 'title', u'form': None, u'add': None, u'add2': None}
{u'event': None, u'tag': None, u'group': 'group', u'artist': 'artist', u'title': 'title', u'form': None, u'add': None, u'add2': None}
{u'event': None, u'tag': None, u'group': None, u'artist': None, u'title': 'title', u'form': None, u'add': None, u'add2': None}
This regex is composed of couple parts:
(?:^[\[()](?P<event>[^)\]]+)[)\]](?=.+[\])]$)\s)?
- matching events(?:[(【](?P<tag>(?<=^[(【])[^】)]+(?=.+[\w】]$)|(?<=\)\s\()[^)]+(?=\)\s\[))[】)]\s)?
- matching tags\[(?:(?P<group>[^(\]]+)\s+\((?P<artist>[^)]+)\)\])\s+)?
- matching groups(?P<title>[^(\n)【]+)
- matching title(?:\s*[\(【](?P<form>[^)】]+)[)】](?:\s*[\[(](?P<add>[^\])]+)[\])])?(?:\s*[\[(](?P<add2>[^\])]+)[\])])?)?
- matching form and addsAs you can see, every part, excluding part matching a title
, ends with ?
quantifier, which means zero or one. Because of that, these part are optional, it will match if there is fragment to match, but if not, it will not disturb (at least it should not) how rest of regex will work. This is why it seems like it match "from a middle", not "from left to right".
Upvotes: 1