Reputation: 445
I know I can use xpath, but in this case it wouldn't work because of the complexity of the navigation of the site.
I can only use the source code.
I have browsed all over the place and couldn't find a simple php solution that would:
So, basically, I need to extract the text between
knownhtmlcodestart> Text to extract <knownhtmlcodeend
What I'm trying to achieve in the end is this:
The website that I'm going to extract data from is changing dynamically. So it would always store new data into the same file.
Then I would use that data (but that's a question for another time).
I would appreciate it if anyone could lead me to a simple solution.
Not asking to write a code, but maybe someone did anything similar and sharing the code here would be helpful.
Thanks
Upvotes: 1
Views: 820
Reputation: 2009
This would assume the opening and closing tag are on the same line (as in your example). If the tags can be on separate lines, it wouldn't be difficult to adapt this.
$html = file_get_contents('website.com');
$lines = explode("\n", $html);
foreach($lines as $word) {
$t1 = strpos($word, "knownhtmlcodestart");
$t2 = strpos($word, "knownhtmlcodeend");
if ($t1)
$c1 = $t1;
if ($t2)
$c2 = $t2;
if ($c1 && $c2){
$text = substring($word, $c1, $c2-$c1);
break;
}
}
echo $text;
Upvotes: 1
Reputation: 4180
I (shamefully) found the following function useful to extract stuff from HTML. Regexes sometimes are too complex to extract large stuff, e.g. a whole <table>
/*
$start - string marking the start of the sequence you want to extract
$end - string marking the end of it..
$offset - starting position in case you need to find multiple occurrences
returns the string between `$start` and `$end`, and the indexes of start and end
*/
function strExt($str, $start, $end = null, $offset = 0)
{
$p1 = mb_strpos($str,$start,$offset);
if ($p1 === false) return false;
$p1 += mb_strlen($start);
$p2 = $end === null ? mb_strlen($str) : mb_strpos($str,$end, $p1+1);
return
[
'str' => mb_substr($str, $p1, $p2-$p1),
'start' => $p1,
'end' => $p2];
}
Upvotes: 1