Reputation: 670
I'm trying to extract languages list from Wikipedia article: List_of_programming_languages_by_type. There are few lines:
- [[Ada (programming language)|Ada]] (multi-purpose language)
- [[Afnix (programming language)|Afnix]] – concurrent access to data is protected automatically (previously called ''Aleph'', but unrelated to ''Alef'')
- [[Cilk]] – a concurrent [[C (programming language)|C]]
Almost all lines are parsed correct, except lines with multiple [[ ]] blocks (a line with Click language in the example). Parsing code:
for line in lines:
lang = re.search('^\*+\s*(\[\['
'((?P<wiki_link>.+?)(\|))?'
'(?P<lang_name>.+?)'
'\]\])', line)
if lang:
print lang.groupdict()
And output:
{'wiki_link': u'Ada (programming language)', 'lang_name': u'Ada'}
{'wiki_link': u'Afnix (programming language)', 'lang_name': u'Afnix'}
{'wiki_link': u'Cilk]] – a concurrent [[C (programming language)', 'lang_name': u'C'}
How can I managed with multiple [[ ]] blocks in one line?
P.S. expected results:
{'wiki_link': None, 'lang_name': u'Clik'}
Upvotes: 1
Views: 769
Reputation: 6315
You're almost there:
lang = re.search('^\*+\s*(\[\['
'((?P<wiki_link>[^]]+?)(\|))?'
'(?P<lang_name>.+?)'
'\]\])', line)
Just change (?P<wiki_link>.+?)
to (?P<wiki_link>[^]]+?)
.
It will not match nested structure.
>>> print lang.groupdict()
{'wiki_link': None, 'lang_name': 'Cilk'}
Upvotes: 1