Reputation: 5374
Here's what I'm doing: I'm scraping some HTML from an external site using Simple HTML Dom, then I strip the spaces out, and I try to use a Regex to grab the information I need and put it into an array. This code was working perfectly until the external site modified their HTML and I had to come up with a new regex. I made a regex that seemed to capture everything I wanted (I used regexr.com), but for some reason it isn't working now that I plug it into my code. Here's the PHP:
<?php
header("Content-Type: text/plain");
require('../classes/simple_html_dom.php');
$html = file_get_html('http://www.***.com/');
$player_array = array();
foreach($html->find('table#herodev_list td') as $ele){
$ele = $ele->innertext;
$html_string = $html_string.$ele;
}
$html_string = str_replace(" ", "", $html_string);
$regex = '/(?<=/avatar/).+?(?=/)/';
preg_match_all($regex, $html_string, $matches);
foreach($matches[0] as $player){
array_push($player_array, strtolower($player));
}
print_r($player_array);
The problem seems to lie in the preg_match_all - the matches array is empty so I'm assuming nothing was matched. Here is a sample of what $html_string usually looks like:
<imgsrc="http://minotar.net/avatar/Kainzo/10.png"><imgsrc="http://minotar.net/avatar/PuffinMuffin19/10.png"><imgsrc="http://minotar.net/avatar/neows0/10.png"><imgsrc="http://minotar.net/avatar/Sniped105/10.png"><imgsrc="http://minotar.net/avatar/EJBomber26/10.png"><imgsrc="http://minotar.net/avatar/GiantBeardedFace/10.png"><imgsrc="http://minotar.net/avatar/Montelu/10.png"><imgsrc="http://minotar.net/avatar/GreekCrackShot/10.png"><imgsrc="http://minotar.net/avatar/Marcellinius/10.png"><imgsrc="http://minotar.net/avatar/HelsEch/10.png"><imgsrc="http://minotar.net/avatar/NZD2000/10.png"><imgsrc="http://minotar.net/avatar/Mrchucklez/10.png"><imgsrc="http://minotar.net/avatar/Dragondrakar/10.png"><imgsrc="http://minotar.net/avatar/malita55/10.png"><imgsrc="http://minotar.net/avatar/Dazzlar/10.png">
My best guess is that PHP's regex engine differs somehow from Regexr or I'm just doing something stupid. It's been months since I originally wrote this app so its inner workings are not fresh in my mind. Any help is appreciated.
Also, please don't give me the old, "Don't use Regular Expressions to parse HTML..." speech. I know.
By the way, this is my old regex that worked properly (the input was different though of course):
(?<=^|>)[^><]+?(?=<|$)
.
Upvotes: 1
Views: 136
Reputation: 915
You have to escape the /'s in the regex code.
-EDIT-
ascii-lime also pointed out that you can change the delimiter to another non-alphanumeric character(with exceptions). To do this change the / at the start and end of the expression to the character of choice. Example:
'/.+\/regex.com\/index.html+./'
to
'!.+/regex.com/index.html+.!'
Upvotes: 4