PSilvestre
PSilvestre

Reputation: 177

Trying to extract keywords from a website PHP (OOP)

haha, I still have the problem of keywords, but this is a code that I'm creating.

Is a poor code but is my creation:

<?php
$url = 'http://es.wikipedia.org/wiki/Animalia';
Keys($url);
function Keys($url) { 
$listanegra = array("a", "ante", "bajo", "con", "contra", "de", "desde", "mediante", "durante", "hasta", "hacia", "para", "por", "que", "qué", "cuán", "cuan", "los", "las", "una", "unos", "unas", "donde", "dónde", "como", "cómo", "cuando", "porque", "por", "para", "según", "sin", "tras", "con", "mas", "más", "pero", "del"); 

    $doc = new DOMDocument();
    libxml_use_internal_errors(true);
    $doc->loadHTMLFile($url);
    $webhtml = $doc->getElementsByTagName('p');
    $webhtml = $webhtml ->item(0)->nodeValue;

    $webhtml = strip_tags($webhtml);
    $webhtml = explode(" ", $webhtml);

    foreach($listanegra as $key=> $ln) {
    $webhtml = str_replace($ln, " ", $webhtml);
    }
    $palabras = str_word_count ("$webhtml", 1 ); 
    $frq = array_count_values ($palabras); 
    $frq = asort($frq);
    $ffrq = count($frq);
$i=1;
while ($i < $ffrq) {
    print $frqq[$i];
    print '<br />';
    $i++;
}
}
?>

The code trying extract keywords of a website. Extracts the first paragraph of a web, and deletes the words of the variable "$listanegra". Next, counts the repeat words and saves all words in a "array". After i call the array, and this show me the words.

The problem is... the code it's not functional =(.

When i use the code, this shows blank.

Could help me finish my code?. Was recommending me to using "tf-idf", but I will use it later.

Upvotes: 1

Views: 1695

Answers (2)

Your server should show the errors if you are testing : add this after

ini_set('display_errors', 1); 
ini_set('log_errors', 1); 
ini_set('error_log', dirname(__FILE__) . '/error_log.txt'); 
error_reporting(E_ALL);

that way you will see the error: Array to string conversion on line 24 (line 19 if you don't put the 5 new lines)

here are some errors i found 4 functions are not used as they should str_replace, str_word_count , asort , array_count_values.

Using str_replace is a little tricky. Trying to find and remove a removes all the "a" in the text even in "animal". (str_replace("a","animal") => nmal) this link should be usefull : link

asort return true or false so doing just:

asort($frq);

will sort the values in alphabetical order. $frq returns the result of array_count_values --> $frq = array($word1=>word1_count , ...) the value here is the number of times the word is used so when later you have :

 print $**frq**[$i]; // you have  print $frqq[$i]; in your code

the result will be empty since the index of this array are the words and the values the number of time the words appear in the text.

Also with str_word_count you must be really careful, since you are reading Hispanic text and text can have numbers you shoudl use this

str_word_count($string,1,'áéíóúüñ1234567890');

The code i would suggest :

<?php
header('Content-Type: text/html; charset=UTF-8');
ini_set('display_errors', 1); 
ini_set('log_errors', 1); 
ini_set('error_log', dirname(__FILE__) . '/error_log.txt'); 
error_reporting(E_ALL);



$url = 'http://es.wikipedia.org/wiki/Animalia';
Keys($url);
function Keys($url) { 
$listanegra = array("a", "ante", "bajo", "con", "contra", "de", "desde", "mediante", "durante", "hasta", "hacia", "para", "por", "que", "qué", "cuán", "cuan", "los", "las", "una", "unos", "unas", "donde", "dónde", "como", "cómo", "cuando", "porque", "por", "para", "según", "sin", "tras", "con", "mas", "más", "pero", "del"); 

$html=file_get_contents($url);

    $doc = new DOMDocument('1.0', 'UTF-8');
    $html = mb_convert_encoding($html, "HTML-ENTITIES", "UTF-8"); 
    libxml_use_internal_errors(true);
     $doc->loadHTML($html);
    $webhtml = $doc->getElementsByTagName('p');


    $webhtml = $webhtml ->item(0)->nodeValue;

    $webhtml = strip_tags($webhtml);
    print_r ($webhtml);
    $webhtml = explode(" ", $webhtml);


   // $webhtml = str_replace($listanegra, " ", $webhtml); str_replace() accepts array

    foreach($listanegra as $key=> $ln) {

    $webhtml = preg_replace('/\b'.$ln.'\b/u', ' ', $webhtml);
    }

    $palabras = str_word_count(implode(" ",$webhtml), 1, 'áéíóúüñ1234567890');

    sort($palabras);

    $frq = array_count_values ($palabras);


foreach($frq as $index=>$value) {
    print "the word <strong>$index</strong>  was used <strong>$value</strong> times";
    print '<br />';

}
}
?>

Was really painfull trying to figure out the special chars issues

Upvotes: 0

kittycat
kittycat

Reputation: 15044

I do believe this is what you were trying to do:

$url = 'http://es.wikipedia.org/wiki/Animalia';

$words = Keys($url);

/// do your database stuff with $words


function Keys($url)
{
    $listanegra = array('a', 'ante', 'bajo', 'con', 'contra', 'de', 'desde', 'mediante', 'durante', 'hasta', 'hacia', 'para', 'por', 'que', 'qué', 'cuán', 'cuan', 'los', 'las', 'una', 'unos', 'unas', 'donde', 'dónde', 'como', 'cómo', 'cuando', 'porque', 'por', 'para', 'según', 'sin', 'tras', 'con', 'mas', 'más', 'pero', 'del');

    $doc = new DOMDocument();
    libxml_use_internal_errors(true);
    $doc->loadHTMLFile($url);
    $webhtml = $doc->getElementsByTagName('p');
    $webhtml = $webhtml->item(0)->nodeValue;
    $webhtml = strip_tags($webhtml);
    $webhtml = explode(' ', $webhtml);

    $palabras = array();
    foreach($webhtml as $word)
    {
        $word = strtolower(trim($word, ' .,!?()')); // remove trailing special chars and spaces
        if (!in_array($word, $listanegra))
        {
            $palabras[] = $word;
        }
    }
    $frq = array_count_values($palabras);
    asort($frq);
    return implode(' ', array_keys($frq));
}

Upvotes: 1

Related Questions