Reputation: 61
I'm writing a Python script to parse Wikipedia articles, and part of that process is parsing links. I'm trying to write a regular expression that matches in this way:
[[:Category:Anarchism by country|Anarchism by country]]
-> :Category:Anarchism by country
[[Squatting|squat]]
-> Squatting
[[File:Jarach and Zerzan.JPG|thumb|Lawrence Jarach (left) and [[John Zerzan]] (right)
-> John Zerzan
* {{cite book |last=Avrich |first=Paul |author-link=Paul Avrich |title=[[Anarchist Voices: An Oral History of Anarchism in America]] |year=1996 |publisher=[[Princeton University Press]] |isbn=978-0-691-04494-1
-> Unmatched, begins with * {{
(citation)I've reached \[\[([^|\]]+)(?:\|[^|\]]+)?\]\]
which works in 3 of the above examples, but in the citation it matches the title and the publisher. I know (I think) I need a negative lookahead to prevent any matches in the last example. I'm very bad with regex however, so any suggestions would be greatly appreciated.
Upvotes: 0
Views: 54
Reputation: 10437
Wikitext is quite complicated and should not be parsed with regexes alone. Instead, use a full-fledged parser, such as mwparserfromhell
:
import mwparserfromhell as mph
def get_links_outside_of_templates(text):
tree = mph.parse(text)
# Lazily filter out all top-level links
links = tree.ifilter_wikilinks(recursive = False)
for link in links:
if link.title.startswith('File'):
# If this is a File link, recursively parse its "text".
yield from get_links_outside_of_templates(link.text)
else:
yield link.title
print([*get_links_outside_of_templates(text)])
For the following wikitext (partly generated by ChatGPT):
'''Squatting''' may refer to [[Squatting|squat]], the act of occupying an abandoned or unused property without legal permission.
== Foo ==
[[File:Jarach and Zerzan.JPG|thumb|Lawrence Jarach (left) and [[John Zerzan]] (right)]]
Lorem ipsum dolor sit [[amet]], consectetur adipiscing elit. Vestibulum interdum, neque nec aliquet venenatis, tortor erat commodo nulla, id imperdiet mi urna eget nunc.
== References ==
* {{cite book
|last=Avrich |first=Paul |author-link=Paul Avrich
|title=[[Anarchist Voices: An Oral History of Anarchism in America]]
|year=1996 |publisher=[[Princeton University Press]]
|isbn=978-0-691-04494-1
}}
[[:Category:Anarchism by country|Anarchism by country]]
...it outputs:
['Squatting', 'John Zerzan', 'amet', ':Category:Anarchism by country']
Unfortunately, mwparserfromhell
doesn't recognize namespaces, so you will have to check for File
links on your own if you were to use it. I use a crude .startswith('File')
in the function above, but you might want to make a better check, since namespace names are case-insensitive: file
and fIlE
are both valid and means the same as File
.
Upvotes: 1