Reputation: 144
I'm trying to get the actual HTML of about 600-700 pages hosted on a Media Wiki. I have thought of/tried the following:
Action API with action=parse
: works well, takes about 0.75 seconds per page. I haven't been able to do this for multiple pages with "|" between their names. The only option is a for loop (python example)
def get_html(title):
headers = {"User-Agent": USER_AGENT}
pages = []
if not isinstance(title, list):
title = [title]
for t in title:
params = {
"page": t,
"action": "parse",
"format": "json",
"formatversion": "2",
}
req = requests.get(API_URL, headers=headers, params=params)
pages.append(req.json()['parse']['text'])
return pages
Use the render action on the Wiki's URL (not API). Also takes only one page at a time. Averages 0.75 seconds per page.
The REST API with the /page/title/with_html
or /page/title/html
endpoints. These require RESTBase and VirtualRESTService on the Wiki, but they're not there and I cannot install them (it's not my Wiki). The REST API for this wiki also averages 0.75 seconds per request (for getting the wikitext), so it wouldn't be a better option anyway.
I already have the Wikitext (getting it for the 600-700 pages using action=query
with 50 pages per request takes about 10 seconds). I can pass it to Wikipedia's Parasoid and deal with the messed up templates, links, etc. later. This still requires a for loop, unfortunately, but it's much quicker at about 0.185 seconds per page.
def get_html_parasoid(text):
pages = []
if not isinstance(text, list):
text = [text]
for t in text:
data = {
"wikitext": t,
}
PARSOID_URL = "https://en.wikipedia.org/api/rest_v1/transform/wikitext/to/html/"
headers = {"User-Agent": USER_AGENT}
req = requests.post(PARSOID_URL, headers=headers, data=data)
pages.append(req.text)
return pages
Use rvparse
with action=query
and prop=revisions
. Unfortunately, this is deprecated (with no replacement besdies action=parse
as far as I know. Further, it only returns the HTML for the first argument (documentation says this option forcefully sets rvlimit=1
).
Is a for loop really the best I can do with the MediaWiki API? The Parasoid solution works okay-ish (2 minutes to generate all the HTML) but is not ideal. Is there a way to send a GET request to parse
or render
with multiple pages like you can with query
? (?titles=file1|file2|file3|...
)
I'm more than happy with a local solution to convert the wiki text I already have to HTML. Ideally, I don't want my package's users to install anything external outside of their pip/conda venv to get this to work, but if that's the best I can do...
Upvotes: 3
Views: 157