I am creating a php script to scrape the images and respective dimension recommendations from
After extracting the image path and the suggested new height and width, I will programmatically optimize my images.
The following is the relevant portion of the html returned from the Uniform Resource Locator:
<tr class="rules-details" style="display: none">
<td colspan="4">
<a href="/serve-scaled-images.html" class="rule-help btn hover-tooltip" data-tooltip-interactive data-tooltip-max-width="450" title="<h4>Serve scaled images</h4><p>Serving appropriately-sized images can save many bytes of data and improve the performance of your webpage, especially on low-powered (eg. mobile) devices.</p><p class="rule-help-tooltip-more"><a href="/serve-scaled-images.html">Read more</a></p>"><i class="sprite-question"></i><span class="resp-hidden">What's this mean?</span></a>
<p>The following images are resized in HTML or CSS. Serving scaled images could save 1.3MiB (45% reduction).
<li><a href="" target="_blank" rel="nofollow noopener noreferrer"></a> is resized in HTML or CSS from 300x623 to 123x200. Serving a scaled image could save 51.3KiB (86% reduction).</li>
<li><a href="" target="_blank" rel="nofollow noopener noreferrer"></a> is resized in HTML or CSS from 300x578 to 135x200. Serving a scaled image could save 44.0KiB (84% reduction).</li>
<li><a href="" target="_blank" rel="nofollow noopener noreferrer"></a> is resized in HTML or CSS from 300x390 to 176x200. Serving a scaled image could save 43.2KiB (69% reduction).</li>
<li><a href="" target="_blank" rel="nofollow noopener noreferrer"></a> is resized in HTML or CSS from 300x436 to 174x200. Serving a scaled image could save 35.0KiB (73% reduction).</li>
<li><a href="" target="_blank" rel="nofollow noopener noreferrer"></a> is resized in HTML or CSS from 148x100 to 68x46. Serving a scaled image could save 31.4KiB (78% reduction).</li>
<li><a href="" target="_blank" rel="nofollow noopener noreferrer"></a> is resized in HTML or CSS from 148x100 to 68x46. Serving a scaled image could save 30.9KiB (78% reduction).</li>
<li><a href="" target="_blank" rel="nofollow noopener noreferrer"></a> is resized in HTML or CSS from 148x100 to 68x46. Serving a scaled image could save 30.7KiB (78% reduction).</li>
<li><a href="" target="_blank" rel="nofollow noopener noreferrer"></a> is resized in HTML or CSS from 148x100 to 68x46. Serving a scaled image could save 30.7KiB (78% reduction).</li>
<li><a href="" target="_blank" rel="nofollow noopener noreferrer"></a> is resized in HTML or CSS from 300x458 to 138x200. Serving a scaled image could save 28.9KiB (79% reduction).</li>
After advice from John Conde to use a DOM parser, here is my coding attempt:
$html = file_get_contents('');
$document = new DOMDocument();
$xpath = new DOMXpath($document);
$stack = array();
$expression = './/tr[contains(concat(" ", normalize-space(@class), " "), " rules-details ")]';
foreach ($xpath->evaluate($expression) as $tr)
array_push($stack, $tr->nodeValue);
foreach ($stack as $string)
$search_string = $string;
$find = 'reduction';
$pos = strpos($search_string, $find);
$string = str_replace("What's this mean?","",$string);
$string = trim(preg_replace("/\s+/", " ", $string));
$string_array = explode(').', $string);
$search_string = $string_array[$i];
$find = 'The following images are resized in HTML or CSS.';
$pos = strpos($search_string, $find);
$find = "Optimize the following images to reduce their size by";
$pos = strpos($search_string, $find);
$current_index = $string_array[$i];
$array_size = sizeof($string_array);
echo '<pre>'.$string_array[$i];
The question is, given the following string, how do I extract the url and second image dimension? is resized in HTML or CSS from 300x458 to 138x200. Serving a scaled image could save 28.9KiB (79% reduction).
I need:
I will be optimizing this prototype script, but this is how I am implementing JohnConde's answer:
// #########################################
// #########################################
class Image
public $image_url;
public $image_name;
public $image_path;
public $image_full_path;
public $original_size;
public $new_size;
$debugging = true;
if($debugging === true){echo '<ul class="Results" style="display:block; height:auto;">';}
$HTML = file_get_contents('');// Get Webpage
case false:
if($debugging === true)
$error = error_get_last();
echo '<li class="Error_Msg" style="display:block; height:auto;">';
echo '<span><b>## FATAL ERROR - PROGRAM ABORTED ##</b></span>';
echo '<span><b>Message:</b> Could not retrieve the HTML document</span>';
echo '</li>';
$DOMdoc = new DOMDocument();// Object to store an HTML document
$html = @$DOMdoc->loadHTML($HTML);// Parse the HTML
$racks = (new DOMXPath($DOMdoc))->query('//tr/td/div//ul/li');// Creates a new DOMXPath object from the XPath expression
$images_info_array = array();// Array for storing image details objects
$document_root = $_SERVER['DOCUMENT_ROOT'];// Define the document root
foreach($racks as $rack)// Traverse over the HTML structure
// Define a pattern to search for
$expression = "/https?\:\/\/[^\",]+ is resized in HTML or CSS from \d{1,3}x\d{1,3} to \d{1,3}x\d{1,3}./";
if(preg_match_all($expression, $rack->nodeValue, $matched) == 1)// If the pattern is found then
$url = $rack->firstChild->nodeValue;// Get the URL from the string
preg_match_all('/\d{1,4}x\d{1,4}/', $rack->nodeValue, $matches);// Get the image dimensions from the string
[$original_size, $new_size] = $matches[0];//
$url_parts = parse_url($url);// Break the URL up into sections
$directory_path = $url_parts['path'];// Get the directory path without the domain
$path_parts = pathinfo($directory_path);// Get information about a file path
$position = strpos($directory_path, '/');// Find the first / in the file path
if ($position !== false)// If found
$new_directory_path = substr_replace($directory_path, "", $position, strlen('/'));// Remove the /
$image_info = new Image();// Create a new Image Object
$image_info->image_url = $url;// Store the image URL
$image_info->image_name = basename($url);// Store just the image name
$image_info->image_path = $path_parts['dirname'];// Store image directory without domain & file name
$image_info->image_full_path = $new_directory_path;//
$image_info->original_size = $original_size;// Store the original image size
$image_info->new_size = $new_size;// Store the new image size
array_push($images_info_array, $image_info);// Add the image information to an array
if($debugging === true)
$error = error_get_last();
echo '<li class="Warning_Msg">';
echo '<span><b>## WARNING - FILE PATH CHARACTER MISSING ##</b></span>';
echo '<span><b>Message:</b> / in the file path not found</span>';
echo '</li>';
}else{// If the pattern is not found then
if($debugging === true)
$error = error_get_last();
echo '<li class="Error_Msg" style="display:block; height:auto;">';
echo '<span><b>## FATAL ERROR - PROGRAM ABORTED ##</b></span>';
echo '<span><b>Message:</b> Could not find the pattern required to extract the URL & size information</span>';
echo '</li>';
foreach($images_info_array as $image_info)// Traverse the image info array
if(file_exists($image_info->image_full_path))// Check if the image exists
$temp_path = $document_root.$image_info->image_path.'/temp/';// Define a temporary folder location
switch(file_exists($temp_path))// Check if the temporary folder exists
case true:// If it does recursively delete it
$files = new RecursiveIteratorIterator(new RecursiveDirectoryIterator($temp_path, RecursiveDirectoryIterator::SKIP_DOTS), RecursiveIteratorIterator::CHILD_FIRST);
foreach ($files as $fileinfo)
$todo = ($fileinfo->isDir() ? 'rmdir' : 'unlink');
case false:// If it does not exist create it
mkdir($temp_path, 0777);// If it doesnt create the temporary folder
// Define the convert command for recommended optimization of the image
$command = 'convert -thumbnail '.$image_info->new_size.' "'.$document_root.'/'.$image_info->image_full_path.'" "'.$document_root.''.$image_info->image_path.'/temp/'.$image_info->image_name.'" 2>&1';
$last_line = system($command, $return_value);// Run the defined command
if($debugging === true)
switch ($return_value)
case true:
echo '<li class="Normal_Message">';
echo '<span><b>Command:</b> '.$command.'</span>';
echo '<span><b>Directory:</b> '.$item->image_full_path.'</span>';
echo '<span><b>Resized:</b> '.$item->new_size.'</span>';
echo '<span><b>Returned:</b> '.$return_value.'</span>';
echo '<span><b>Output:</b> '.$last_line.'</span>';
echo '</li>';
case false;
$error = error_get_last();
echo '<li class="Error_Msg" style="display:block; height:auto;">';
echo '<span><b>## ERROR - THE COMMAND DID NOT COMPLETE ##</b></span>';
echo '<span><b>TYPE:</b> '.$error['type'].'</span>';
echo '<span><b>MESSAGE:</b> '.$error['message'].'</span>';
echo '<span><b>FILE:</b> '.$error['file'].'</span>';
echo '<span><b>LINE:</b> '.$error['line'].'</span>';
echo '</li>';
else// If the file does not exist
echo '<li class="Warning_Message" style="display:block; height:auto;">The file doesn\'t exist</li>';
catch(Exception $Error_Message)
echo $Error_Message;
echo '</ul>';
Upvotes: 0
Views: 599
Reputation: 48041
I will offer a slightly altered approach from John's answer.
Use XPath to access the desired <a>
tags, then grab their values, then isolate the <a>
tag's parent value and use preg_match to isolate the dimensional substring after the keyword to
resets the fullstring match so that no capture groups are necessary).
Code: (Demo)
$dom = new DOMDocument();
$xpath = new DOMXPath($dom);
$result = [];
foreach ($xpath->query('//tr/td/div//ul/li/a') as $a) {
$result[] = [
preg_match('~to \K\d+x\d+~', $a->parentNode->nodeValue, $m) ? $m[0] : ''
Note that I am suppressing the html error generated by the <p>
Why: Should ol/ul be inside <p> or outside?
For this reason, the XPath expression jumps passed the p
tag straight to the ul
inside of it.
array (
0 =>
array (
0 => '',
1 => '123x200',
1 =>
array (
0 => '',
1 => '135x200',
2 =>
array (
0 => '',
1 => '176x200',
3 =>
array (
0 => '',
1 => '174x200',
4 =>
array (
0 => '',
1 => '68x46',
5 =>
array (
0 => '',
1 => '68x46',
6 =>
array (
0 => '',
1 => '68x46',
7 =>
array (
0 => '',
1 => '68x46',
8 =>
array (
0 => '',
1 => '138x200',
Upvotes: 0
Reputation: 219894
This will parse that HTML and output the text you are looking for:
$html = '<tr class="rules-details" style="display: none">
<td colspan="4">
<a href="/serve-scaled-images.html" class="rule-help btn hover-tooltip" data-tooltip-interactive data-tooltip-max-width="450" title="<h4>Serve scaled images</h4><p>Serving appropriately-sized images can save many bytes of data and improve the performance of your webpage, especially on low-powered (eg. mobile) devices.</p><p class="rule-help-tooltip-more"><a href="/serve-scaled-images.html">Read more</a></p>"><i class="sprite-question"></i><span class="resp-hidden">What\'s this mean?</span></a>
<p>The following images are resized in HTML or CSS. Serving scaled images could save 1.3MiB (45% reduction).
<li><a href="" target="_blank" rel="nofollow noopener noreferrer"></a> is resized in HTML or CSS from 300x623 to 123x200. Serving a scaled image could save 51.3KiB (86% reduction).</li>
<li><a href="" target="_blank" rel="nofollow noopener noreferrer"></a> is resized in HTML or CSS from 300x578 to 135x200. Serving a scaled image could save 44.0KiB (84% reduction).</li>
<li><a href="" target="_blank" rel="nofollow noopener noreferrer"></a> is resized in HTML or CSS from 300x390 to 176x200. Serving a scaled image could save 43.2KiB (69% reduction).</li>
<li><a href="" target="_blank" rel="nofollow noopener noreferrer"></a> is resized in HTML or CSS from 300x436 to 174x200. Serving a scaled image could save 35.0KiB (73% reduction).</li>
<li><a href="" target="_blank" rel="nofollow noopener noreferrer"></a> is resized in HTML or CSS from 148x100 to 68x46. Serving a scaled image could save 31.4KiB (78% reduction).</li>
<li><a href="" target="_blank" rel="nofollow noopener noreferrer"></a> is resized in HTML or CSS from 148x100 to 68x46. Serving a scaled image could save 30.9KiB (78% reduction).</li>
<li><a href="" target="_blank" rel="nofollow noopener noreferrer"></a> is resized in HTML or CSS from 148x100 to 68x46. Serving a scaled image could save 30.7KiB (78% reduction).</li>
<li><a href="" target="_blank" rel="nofollow noopener noreferrer"></a> is resized in HTML or CSS from 148x100 to 68x46. Serving a scaled image could save 30.7KiB (78% reduction).</li>
<li><a href="" target="_blank" rel="nofollow noopener noreferrer"></a> is resized in HTML or CSS from 300x458 to 138x200. Serving a scaled image could save 28.9KiB (79% reduction).</li>
$doc = new DOMDocument();
$html = @$doc->loadHTML($html);
$items = (new DOMXPath($doc))->query('//tr/td/div//ul/li');
foreach ($items as $item) {
$url = $item->firstChild->nodeValue;
preg_match_all('/\d{1,3}x\d{1,3}/', $item->nodeValue, $matches);
[$original, $resized] = $matches[0];
printf('URL:%s Original: %s Resized: %s%s', $url, $original, $resized, PHP_EOL);
URL: Original: 300x623 Resized: 123x200
URL: Original: 300x578 Resized: 135x200
URL: Original: 300x390 Resized: 176x200
URL: Original: 300x436 Resized: 174x200
URL: Original: 148x100 Resized: 68x46
URL: Original: 148x100 Resized: 68x46
URL: Original: 148x100 Resized: 68x46
URL: Original: 148x100 Resized: 68x46
URL: Original: 300x458 Resized: 138x200
Upvotes: 2