Ash
Ash

Reputation: 144

Bulk convert Media Wiki pages to HTML (either via API or locally from saved wikitext)

I'm trying to get the actual HTML of about 600-700 pages hosted on a Media Wiki. I have thought of/tried the following:

Option 1

Action API with action=parse: works well, takes about 0.75 seconds per page. I haven't been able to do this for multiple pages with "|" between their names. The only option is a for loop (python example)

def get_html(title):
    headers = {"User-Agent": USER_AGENT}

    pages = []
    if not isinstance(title, list):
        title = [title]
    for t in title:
        params = {
            "page": t,
            "action": "parse",
            "format": "json",
            "formatversion": "2",
        }
        req = requests.get(API_URL, headers=headers, params=params)
        pages.append(req.json()['parse']['text'])

    return pages

Option 2

Use the render action on the Wiki's URL (not API). Also takes only one page at a time. Averages 0.75 seconds per page.

Option 3

The REST API with the /page/title/with_html or /page/title/html endpoints. These require RESTBase and VirtualRESTService on the Wiki, but they're not there and I cannot install them (it's not my Wiki). The REST API for this wiki also averages 0.75 seconds per request (for getting the wikitext), so it wouldn't be a better option anyway.

Option 4

I already have the Wikitext (getting it for the 600-700 pages using action=query with 50 pages per request takes about 10 seconds). I can pass it to Wikipedia's Parasoid and deal with the messed up templates, links, etc. later. This still requires a for loop, unfortunately, but it's much quicker at about 0.185 seconds per page.

def get_html_parasoid(text):
    pages = []

    if not isinstance(text, list):
        text = [text]
    for t in text:
        data = {
            "wikitext": t,
        }
        PARSOID_URL = "https://en.wikipedia.org/api/rest_v1/transform/wikitext/to/html/"
        headers = {"User-Agent": USER_AGENT}
        req = requests.post(PARSOID_URL, headers=headers, data=data)
        pages.append(req.text)

    return pages

Option 5:

Use rvparse with action=query and prop=revisions. Unfortunately, this is deprecated (with no replacement besdies action=parse as far as I know. Further, it only returns the HTML for the first argument (documentation says this option forcefully sets rvlimit=1).

Other Options?

Is a for loop really the best I can do with the MediaWiki API? The Parasoid solution works okay-ish (2 minutes to generate all the HTML) but is not ideal. Is there a way to send a GET request to parse or render with multiple pages like you can with query? (?titles=file1|file2|file3|...)

I'm more than happy with a local solution to convert the wiki text I already have to HTML. Ideally, I don't want my package's users to install anything external outside of their pip/conda venv to get this to work, but if that's the best I can do...

Upvotes: 3

Views: 157

Answers (0)

Related Questions