422
422

Reputation: 5770

Strip characters from string

I am working on a script, which fetches data from Wikipedia.

A common issue is for example I want to fetch:

North Stradbroke Island

But the string we are fetching is below, so need to remove the crap

[[North Stradbroke Island]]'

Current scrape code is:

    $curl_handle=curl_init();
curl_setopt($curl_handle,CURLOPT_URL,"http://en.wikipedia.org/wiki/Special:Export/" . $wiki['suburb'] . ",_" . $wiki['state'] . "");
curl_setopt($curl_handle,CURLOPT_TIMEOUT,10);
curl_setopt($curl_handle,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($curl_handle,CURLOPT_CONNECTTIMEOUT,10);
curl_setopt($curl_handle,CURLOPT_RETURNTRANSFER,1);
$xml = curl_exec($curl_handle);
curl_close($curl_handle);

$x = simplexml_load_string($xml);
$text   = $x->page->revision->text;

$arr = explode("| ", $text);

$wikipedia = array();
foreach($arr as $s){
    $pair   = preg_split('/= /', $s);
    $key    = substr($pair[0],0,strpos($pair[0]," "));
    switch($key){
        case "lga":
        case "pop":
        case "dist1":
            $wikipedia[$key] = substr($pair[1],0,-1);
            break;
        case "near-nw":
        case "near-n":
        case "near-ne":
        case "near-w":
        case "near-e":
        case "near-sw":
        case "near-s":
        case "near-se":
            $value = $pair[1];
            if($value != ""){
                $value =substr($pair[1],2,strpos($pair[1],",")-2);
            }
            $wikipedia[$key] = $value;
            break;
    }
}

On my page I have :

    <?
    $wiki['suburb'] = str_replace(" ", "_", $r['suburb']);
    $wiki['state'] = convertStateWiki($r['state']);
    include("/path-to-wiki-file/wiki.suburb.php");
    if ($wikipedia != NULL){
?>

and to echo the results: ( example )

<a href="reviews/<?=strtolower($r['state']);?>/<?=strtolower(str_replace(" ", "-", $wikipedia['near-nw']));?>/"><?=$wikipedia['near-nw'];?></a>

So essentially: we grab using wikis export feed, a suburb. That suburb, may have been typed into wikipedia like:

[['Some Suburb Name]'] for example

I need to return the above as : Some Suburb Name

We need to strip all non ALPHA characters , Im not 100% with php, so if this sounds dumb, please feel free to say so. But please dont vote down, as I have provided as much code as possible.

I just need to stop returned data from including anything but Alpha characters ( must allow spaces )

Upvotes: 0

Views: 320

Answers (3)

pdlol
pdlol

Reputation: 389

Here you go:

<?php
$place = $wikipedia['near-nw'];
$place = trim($place, "[]'");
$href = str_replace(" ", "-", $place);
?>
<a href="reviews/<?=strtolower($r['state'] . "/" . $href);?>/"><?=$place;?></a>

Upvotes: 1

Ja͢ck
Ja͢ck

Reputation: 173602

Wiki Markup is actually pretty well documented.

However, for your case, a simple trim($str, "[]'") should do it :)

In your case:

$wiki['suburb'] = str_replace(" ", "_", trim($r['suburb'], "[]'"));

Upvotes: 1

valentinas
valentinas

Reputation: 4337

"need to remove the crap" that crap is called Wiki Markup and it is machine readable. Here's a list of parsers: http://www.mediawiki.org/wiki/Alternative_parsers

If you will strip all non-alphanumeric characters, then you will end up with a whole lot of lost information. Just parse the markup and then output it in whatever format you like.

Upvotes: -1

Related Questions