mishraoft
mishraoft

Reputation: 113

how to scrape hindi text from web using php

Here i am trying to scrape data from the web (in url) that is in hindi but I am getting response like this

\u093f\u0938\

How to decode this unicode? Please suggest me what to do my script in PHP.

This script is working correctly with english text so what is happening with english. I have already scraped data with this script. I know this response is dev nagri unicode but how to decode it.

I am new in php problem thanks in advance

$i= 1;
for($i; $i < 6; $i++)
{
    $html file_get_contents("http://www.jagran.com/jokes/child/jokes-1262211".$i.".html");
    libxml_use_internal_errors(true);
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    libxml_clear_errors();
    $nodes = $dom->getElementsByTagName('p');
    $item = array();
    $articles = array();
    foreach ($nodes as $node) {
         $item['msg'] = (strlen($node->nodeValue) > 20 ? $node->nodeValue : '');
         $item['cat_id'] = 1;
         if($item['msg'] !="")
         $articles[] = array_unique($item);
    }
    $articles = json_encode($articles);
    print_r($articles);
}

Upvotes: 1

Views: 1298

Answers (3)

NilsB
NilsB

Reputation: 1188

I think PHPhil's answer is good and I upvoted it. I edited the code as it does not work just to execute the php part - instead it is important to add the right meta tag (see the code below) to show the devnagari properly. Also I wanted to correct the mistake with the missing "=". Unfortunately my edit was rejected so I have to add a new answer with the code corrections.

<html>
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<?php

$i= 1;
for($i; $i < 6; $i++)
{
    $html = file_get_contents("http://www.jagran.com/jokes/child/jokes-1262211".$i.".html");
    libxml_use_internal_errors(true);
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    libxml_clear_errors();
    $nodes = $dom->getElementsByTagName('p');
    $item = array();
    $articles = array();
    foreach ($nodes as $node) {
         $item['msg'] = (strlen($node->nodeValue) > 20 ? $node->nodeValue : '');
         $item['cat_id'] = 1;
         if($item['msg'] !="")
         $articles[] = array_unique($item);
    }
    $articles = json_encode($articles, JSON_UNESCAPED_UNICODE);
//--------------------add-this---------------------^
    print_r($articles);
}
?>
</body>
</html>

Upvotes: 0

NilsB
NilsB

Reputation: 1188

You are very close. You receive the signs: ि and स

First you can try is google for the character and you will find the devnagari meaning of the chars:

https://www.google.de/#q=%5Cu093f

https://www.google.de/#q=%5Cu0938

If you want to show unicode in html you have to change the encoding from /u0123 to &#x123. See here:

<html>
<body>
<p>These are two chars in devnagari &#x93f;&#x938;<p>
</body>
</html>

But as you are wanting to scrape Hindi you should start learning how to read and handle unicode. Next question is, how you want to process with your result.

Upvotes: 0

PHPhil
PHPhil

Reputation: 1540

f you are running PHP 5.4 or greater, pass the JSON_UNESCAPED_UNICODE parameter when calling json_encode.

$i= 1;
for($i; $i < 6; $i++)
{
    $html file_get_contents("http://www.jagran.com/jokes/child/jokes-1262211".$i.".html");
    libxml_use_internal_errors(true);
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    libxml_clear_errors();
    $nodes = $dom->getElementsByTagName('p');
    $item = array();
    $articles = array();
    foreach ($nodes as $node) {
         $item['msg'] = (strlen($node->nodeValue) > 20 ? $node->nodeValue : '');
         $item['cat_id'] = 1;
         if($item['msg'] !="")
         $articles[] = array_unique($item);
    }
    $articles = json_encode($articles, JSON_UNESCAPED_UNICODE);
//--------------------add-this---------------------^
    print_r($articles);
}

Upvotes: 1

Related Questions