Reputation: 1541
I'm quite new to MediaWiki, and now I have a bit of a problem. I have the title of some Wiki page, and I want to get just the text of a said page using api.php, but all that I have found in the API is a way to obtain the Wiki content of the page (with wiki markup). I used this HTTP request...
/api.php?action=query&prop=revisions&rvlimit=1&rvprop=content&format=xml&titles=test
But I need only the textual content, without the Wiki markup. Is that possible with the MediaWiki API?
Upvotes: 71
Views: 68913
Reputation: 553
Use action=render to get the cleanest possible page:
https://wiki.eclipse.org/Tip_of_the_Day/Eclipse_Tips/Now_where_was_I?action=render
vs
https://wiki.eclipse.org/Tip_of_the_Day/Eclipse_Tips/Now_where_was_I
Upvotes: 4
Reputation: 136187
Python users coming to this question might be interested in the wikipedia
module (docs):
import wikpedia
wikipedia.set_lang('de')
page = wikipedia.page('Wikipedia')
print(page.content)
Every formatting, except for sections (==
) is striped away.
Upvotes: 7
Reputation: 11
You can do one thing after the contents are brought into your page - you can use the PHP function strip_tags()
to remove the HTML tags.
Upvotes: -2
Reputation: 8845
The TextExtracts extension of the API does about what you're asking. Use prop=extracts
to get a cleaned up response. For example, this link will give you cleaned up text for the Stack Overflow article. What's also nice is that it still includes section tags, so you can identify individual sections of the article.
Just to include a visible link in my answer, the above link looks like:
/api.php?format=xml&action=query&prop=extracts&titles=Stack%20Overflow&redirects=true
Edit: As Amr mentioned, TextExtracts is an extension to MediaWiki, so it won't necessarily be available for every MediaWiki site.
Upvotes: 47
Reputation: 1649
Adding ?action=raw
at the end of a MediaWiki page return the latest content in a raw text format. Eg:- https://en.wikipedia.org/wiki/Main_Page?action=raw
Upvotes: 40
Reputation: 2749
You can get the wiki data in text format from the API by using the explaintext
parameter. Plus, if you need to access many titles' information, you can get all the titles' wiki data in a single call. Use the pipe character |
to separate each title. For example, this API call will return the data from both the "Google" and "Yahoo" pages:
http://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exlimit=max&explaintext&exintro&titles=Yahoo|Google&redirects=
Parameters:
explaintext
: Return extracts as plain text instead of limited HTML.exlimit=max
: Return more than one result. The max is currently 20.exintro
: Return only the content before the first section. If you want the full data, just remove this.redirects=
: Resolve redirect issues.Upvotes: 33
Reputation: 358
That's the simplest way: http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=Albert%20Einstein&prop=revisions&rvprop=content
Upvotes: 11
Reputation: 91467
Use action=parse
to get the html:
/api.php?action=parse&page=test
One way to get the text from the html would be to load it into a browser and walk the nodes, looking only for the text nodes, using JavaScript.
Upvotes: 75
Reputation: 3836
I don't think it is possible using the API to get just the text.
What has worked for me was to request the HTML page (using the normal URL that you would use in a browser) and strip out the HTML tags under the content div.
EDIT:
I have had good results using HTML Parser for Java. It has examples of how to strip out HTML tags under a given DIV.
Upvotes: 6