robert0
robert0

Reputation: 445

Simple PHP code for extracting data from the HTML source code

I know I can use xpath, but in this case it wouldn't work because of the complexity of the navigation of the site.

I can only use the source code.

I have browsed all over the place and couldn't find a simple php solution that would:

  1. Open the HTML source code page (I already have an exact source code page URL).
  2. Select and extract the text between two codes. Not between a div. But I know the start and end variables.

So, basically, I need to extract the text between

knownhtmlcodestart> Text to extract <knownhtmlcodeend

What I'm trying to achieve in the end is this:

  1. Go to a source code URL.
  2. Extract the text between two codes.
  3. Store the data temporarily (define the time manually for how long) on my web server in a simple text file.
  4. Define the waiting time and then repeat the whole process again.

The website that I'm going to extract data from is changing dynamically. So it would always store new data into the same file.

Then I would use that data (but that's a question for another time).

I would appreciate it if anyone could lead me to a simple solution.

Not asking to write a code, but maybe someone did anything similar and sharing the code here would be helpful.

Thanks

Upvotes: 1

Views: 820

Answers (2)

Phaelax z
Phaelax z

Reputation: 2009

This would assume the opening and closing tag are on the same line (as in your example). If the tags can be on separate lines, it wouldn't be difficult to adapt this.

$html = file_get_contents('website.com');


$lines = explode("\n", $html); 

foreach($lines as $word) {
    $t1 = strpos($word, "knownhtmlcodestart");
    $t2 = strpos($word, "knownhtmlcodeend");
    
    if ($t1)
        $c1 = $t1;
    
    if ($t2)
        $c2 = $t2;
    
    if ($c1 && $c2){
        $text = substring($word, $c1, $c2-$c1);
        break;  
    }
}

echo $text;

Upvotes: 1

Eriks Klotins
Eriks Klotins

Reputation: 4180

I (shamefully) found the following function useful to extract stuff from HTML. Regexes sometimes are too complex to extract large stuff, e.g. a whole <table>

/*
   $start - string marking the start of the sequence you want to extract
   $end - string marking the end of it..
   $offset - starting position in case you need to find multiple occurrences
   returns the string between `$start` and `$end`, and the indexes of start and end
*/
function strExt($str, $start, $end = null, $offset = 0)
{
    $p1 = mb_strpos($str,$start,$offset);
    if ($p1 === false) return false;
    $p1 += mb_strlen($start);

    $p2 = $end === null ? mb_strlen($str) : mb_strpos($str,$end, $p1+1);
    return 
        [
            'str'   => mb_substr($str, $p1, $p2-$p1),
            'start' => $p1,
            'end'   => $p2];
}

Upvotes: 1

Related Questions