user1518659
user1518659

Reputation: 2246

How to convert wikipedia url of other languages to english in PHP?

I have a wikipedia url say(of some language but not english),

http://ru.wikipedia.org/wiki/Liz_Claiborne,_Inc

I want to convert this url to english wiki url, ie.

http://en.wikipedia.org/wiki/Liz_Claiborne,_Inc

However I am wondering what is the most effective way of doing this?

I tried searching ".wikipedia" in the string and replaced the previous 2 chars with en.

But what if input is simply,

http://wikipedia.org/wiki/Liz_Claiborne,_Inc

How to handle all the cases?

Hope I am clear with my question. Any help would be appreciated.

Upvotes: 0

Views: 272

Answers (3)

Sverri M. Olsen
Sverri M. Olsen

Reputation: 13283

This will either change existing locales or add one if it is missing:

$urls = array(
    'http://wikipedia.org',
    'http://ru.wikipedia.org',
    'http://en.wikipedia.org',
);
$regex  = '/(?<=^http:\/\/|^https:\/\/)(?:[a-z]{2}\.|\b)(?=wikipedia.org)/i';
$change = 'de';
echo '<pre>';
foreach ($urls as $url)
    echo preg_replace($regex, "$change.", $url), "\n";
die;

The problem with just changing the locale, however, is that you will get a lot of missing pages. The slug that matters is the last one, and it is different for most languages:

http://en.wikipedia.org/wiki/Internet
http://fo.wikipedia.org/wiki/Alnet
http://gv.wikipedia.org/wiki/Eddyr-voggyl

All those pages are about the "Internet", but none of them would be accessible by simply changing the locale.

Upvotes: 2

tmuguet
tmuguet

Reputation: 1165

The name of the page can vary depending on the language, so you cannot simply guess the URL.

The only way working for all pages would be to parse the wikipedia page to find the href value of the "Other languages" links:

<li class="interwiki-en"><a href="__url__" title="__title__" hreflang="en" lang="en">English</a></li>

Upvotes: 1

MrGlass
MrGlass

Reputation: 9262

I would use a regular expression to grab the substring you are looking for. A simple working example:

<?php
$regex = '@http\://.*(wikipedia\.org/.+)@';
$url = 'http://ru.wikipedia.org/wiki/Liz_Claiborne,_Inc';
preg_match($regex, $url, $matches);
echo 'http://en.'.$matches[1];
?>

Upvotes: 1

Related Questions