Alexander Stavonin
Alexander Stavonin

Reputation: 670

Wikipedia link parsing with regex

I'm trying to extract languages list from Wikipedia article: List_of_programming_languages_by_type. There are few lines:

  • [[Ada (programming language)|Ada]] (multi-purpose language)
  • [[Afnix (programming language)|Afnix]] – concurrent access to data is protected automatically (previously called ''Aleph'', but unrelated to ''Alef'')
  • [[Cilk]] – a concurrent [[C (programming language)|C]]

Almost all lines are parsed correct, except lines with multiple [[ ]] blocks (a line with Click language in the example). Parsing code:

for line in lines:
    lang = re.search('^\*+\s*(\[\['
                    '((?P<wiki_link>.+?)(\|))?'
                     '(?P<lang_name>.+?)' 
                     '\]\])', line)
    if lang:
        print lang.groupdict()

And output:

{'wiki_link': u'Ada (programming language)', 'lang_name': u'Ada'}
{'wiki_link': u'Afnix (programming language)', 'lang_name': u'Afnix'}
{'wiki_link': u'Cilk]] &ndash; a concurrent [[C (programming language)', 'lang_name': u'C'}

How can I managed with multiple [[ ]] blocks in one line?

P.S. expected results:

{'wiki_link': None, 'lang_name': u'Clik'}

Upvotes: 1

Views: 769

Answers (1)

Herrington Darkholme
Herrington Darkholme

Reputation: 6315

You're almost there:

lang = re.search('^\*+\s*(\[\['
                '((?P<wiki_link>[^]]+?)(\|))?'
                 '(?P<lang_name>.+?)' 
                 '\]\])', line)

Just change (?P<wiki_link>.+?) to (?P<wiki_link>[^]]+?).

It will not match nested structure.

>>> print lang.groupdict()
 {'wiki_link': None, 'lang_name': 'Cilk'}

Upvotes: 1

Related Questions