jemhop
jemhop

Reputation: 61

Regexpr which excludes groups if they are precedeeded by curly brackets and only matches text within the first section of the bracket

I'm writing a Python script to parse Wikipedia articles, and part of that process is parsing links. I'm trying to write a regular expression that matches in this way:

I've reached \[\[([^|\]]+)(?:\|[^|\]]+)?\]\] which works in 3 of the above examples, but in the citation it matches the title and the publisher. I know (I think) I need a negative lookahead to prevent any matches in the last example. I'm very bad with regex however, so any suggestions would be greatly appreciated.

Upvotes: 0

Views: 54

Answers (1)

InSync
InSync

Reputation: 10437

Wikitext is quite complicated and should not be parsed with regexes alone. Instead, use a full-fledged parser, such as mwparserfromhell:

import mwparserfromhell as mph

def get_links_outside_of_templates(text):
  tree = mph.parse(text)
  # Lazily filter out all top-level links
  links = tree.ifilter_wikilinks(recursive = False)
    
  for link in links:
    if link.title.startswith('File'):
      # If this is a File link, recursively parse its "text".
      yield from get_links_outside_of_templates(link.text)
    else:
      yield link.title

print([*get_links_outside_of_templates(text)])

For the following wikitext (partly generated by ChatGPT):

'''Squatting''' may refer to [[Squatting|squat]], the act of occupying an abandoned or unused property without legal permission.

== Foo ==

[[File:Jarach and Zerzan.JPG|thumb|Lawrence Jarach (left) and [[John Zerzan]] (right)]]

Lorem ipsum dolor sit [[amet]], consectetur adipiscing elit. Vestibulum interdum, neque nec aliquet venenatis, tortor erat commodo nulla, id imperdiet mi urna eget nunc.

== References ==
* {{cite book
  |last=Avrich |first=Paul |author-link=Paul Avrich
  |title=[[Anarchist Voices: An Oral History of Anarchism in America]]
  |year=1996 |publisher=[[Princeton University Press]]
  |isbn=978-0-691-04494-1
  }}

[[:Category:Anarchism by country|Anarchism by country]]

...it outputs:

['Squatting', 'John Zerzan', 'amet', ':Category:Anarchism by country']

Unfortunately, mwparserfromhell doesn't recognize namespaces, so you will have to check for File links on your own if you were to use it. I use a crude .startswith('File') in the function above, but you might want to make a better check, since namespace names are case-insensitive: file and fIlE are both valid and means the same as File.

Upvotes: 1

Related Questions