user1647710
user1647710

Reputation: 45

How to get Wikipedia page HTML with absolute URLs using the API?

I'm trying to retrieve articles through wikipedia API using this code

$url = 'http://en.wikipedia.org/w/api.php?action=parse&page=example&format=json&prop=text';
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$c = curl_exec($ch);
$json = json_decode($c);
$content = $json->{'parse'}->{'text'}->{'*'};

I can view the content in my website and everything is fine but I have a problem with the links inside the article that I have retrieved. If you open the url you can see that all the links start with href=\"/ meaning that if someone clicks on any related link in the article it redirects him to www.mysite.com/wiki/.. (Error 404) instead of en.wikipedia.com/wiki/.. Is there any piece of code that I can add to the existing one to fix this issue?

Upvotes: 3

Views: 1934

Answers (3)

Paul
Paul

Reputation: 151

In case anyone else needs to replace all instances of the URL.

You'll need to use regex and the g flag

/<a href="\/w/g

Upvotes: 0

Ilmari Karonen
Ilmari Karonen

Reputation: 50328

This seems to be a shortcoming in the MediaWiki action=parse API. In fact, someone already filed a feature request asking for an option to make action=parse return full URLs.

As a workaround, you could either try to mangle the links yourself (like adil suggests), or use index.php?action=render like this:

This will only give you the page HTML with no API wrapper, but if that's all you want anyway then it should be fine. (For example, this is the method used internally by InstantCommons to show remote file description pages.)

Upvotes: 4

Adil
Adil

Reputation: 1038

You should be able to fix the links like this:

$content = str_replace('<a href="/w', '<a href="//en.wikipedia.org/w', $content);

Upvotes: 4

Related Questions