AdityaDees
AdityaDees

Reputation: 1047

Using a PHP web-crawler to find certain words without certain elements

I'm following http://simplehtmldom.sourceforge.net/ for making web-crawler using php, but im so confuse how to search for words without specifying an element. So word search is done based on all available data. because the problem here is that now I am specifying the data being searched using the <p> element but when there is no element <p> the result is empty.

this is my code

<?php
include "simple_html_dom.php";
$html = file_get_html('https://adityadees.blogspot.com/');

foreach($html->find('<p>') as $element) 
if (strpos($element, 'yang') !== false) {
    echo $element;
} else {
    echo $element;
}
?>

for example I want to try searching for words that contain 'yang' but, the results are empty because these words don't contain the < p> element. enter image description here

my results enter image description here

but if the word is contained in the < p> element, the result goes well. enter image description here

im tryng to change this line

foreach($html->find('<p>') as $element) 

to

foreach($html->find() as $element) 

but i got errors like this

Fatal error: Uncaught ArgumentCountError: Too few arguments to function simple_html_dom::find(), 0 passed in C:\xampp\htdocs\crawl\index.php on line 5 and at least 1 expected in C:\xampp\htdocs\crawl\simple_html_dom.php:1975 Stack trace: #0 C:\xampp\htdocs\crawl\index.php(5): simple_html_dom->find() #1 {main} thrown in C:\xampp\htdocs\crawl\simple_html_dom.php on line 1975

Upvotes: 1

Views: 2993

Answers (3)

user11222393
user11222393

Reputation: 5471

Do you want to find all paragraphs/text that contains your given word?

<?php 
include('simple_html_dom.php');

$html = file_get_html('https://adityadees.blogspot.com/');

$strings_array = array();

//it searches for any (*) tag with text yang in it
foreach($html->find('*[plaintext*=yang]') as $element) {
    //take only elements which doesn't have childnodes, so are last ones in recursion 
    if ($element->firstChild() == null) {
        //there still are duplicate strings so add only unique values to an array
        if (!in_array($element->innertext, $strings_array)) {
            $strings_array[] = $element->innertext;

        }
    } 
}

echo '<pre>';
print_r($strings_array);
echo '</pre>';

?>

It isn't final solution, but something to start with. At least it finds word yang 61 times - same as in html source of given page.

Upvotes: 1

user11222393
user11222393

Reputation: 5471

Upon inspecting source of given page you can see that post summary is inside div tag with class = item-snippet.

<div class='item-snippet'> Bagaimana Cara Mengganti Akun Mobile Legend ?  itulah yang selalu dipertanyakan oleh orang yang baru memulai bermain game Mobile Legend.  S...</div>

You can get your result if you search for your word in such div's:

include('simple_html_dom.php');

$html = file_get_html('https://adityadees.blogspot.com/');

foreach($html->find('div[class=item-snippet]') as $element) {

    if (strpos($element, 'yang') !== false) {

        echo $element;

    } 

}

result:

Bagaimana Cara Mengganti Akun Mobile Legend ? itulah yang selalu dipertanyakan oleh orang yang baru memulai bermain game Mobile Legend. S...
Bagaimana Cara Mengaitkan Akun Mobile Legend di Patch Baru ? Mungkin masih ada yang bingung tentang cara mengaitkan akun mobile legend den...
Kali ini kita akan membahas tentang bagaimana cara menghitung luas persegi panjangan dengan PHP Hal yang pertama dilakukan adalah membuat ...

Is this you are looking for?

Upvotes: 0

Rillus
Rillus

Reputation: 1278

How about:

foreach($html->find('<body>') as $element) 
if (strpos($element, 'yang') !== false) {
    echo $element;
} else {
    echo $element;
}

Upvotes: 0

Related Questions