Reputation: 31
I try most of the regular expression. but they are not working for me .. i need the regular expression that remove all html tags and return value ....in my html file there are following html tags are :input text, select.
$file_string = file_get_contents('page_to_scrape.html');
preg_match('/<title>(.*)<\/title>/i', $file_string, $title);
$title_out = $title[1];
preg_match('/<option value="ELIT">(.*)<\/option>/i', $file_string, $keywords);
$keywords_out = $keywords[1];
preg_match('/<option value="MAS" selected="selected">(.*)<\/option>/i', $file_string, $ash);
$ash_s = $ash[1];
preg_match('/<input type="text" value="(.*)"/>/i', $file_string, $description);
$description_out = $description[1];
preg_match_all('/<li><a href="(.*)">(.*)<\/a><\/li>/i', $file_string, $links);
?>
<p><strong>Title:</strong> <?php echo $title_out; ?></p>
<p><strong>Name:</strong> <?php echo $keywords_out; ?></p>
<p><strong>TExtbox:</strong> <?php echo $description_out; ?></p>
<p><strong>Event:</strong> <?php echo $ash_s; ?></p>
<p><strong>Links:</strong> <em>(Name - Link)</em><br />
<?php
echo '<ol>';
for($i = 0; $i < count($links[1]); $i++) {
echo '<li>' . $links[2][$i] . ' - ' . $links[1][$i] . '</li>';
}
echo '</ol>';
?>
</p>
Html file This is the Title
</ul>
<div class="field">
<label>Event:</label>
<select name="event" class="event">
<option value="MAS" selected="selected">Same</option>
<option value="ELIT">Same4</option>
<option value="IPC">Same3</option>
<option value="VLMW">Same2</option>
</select>
</div>
<div class="field">
<label class="sub">Surname:</label>
<input name="search[name]" value="Smith" type="text">
<br>
<label class="sub">First Name:</label>
<input name="search[firstname]" value="Alex" type="text">
<br>
</div>
</div>
</body>
</html>
Upvotes: 0
Views: 2910
Reputation: 16688
You could use a DOM parser, but why not keep it simple. HTML uses tags afterall. This piece of code gets all the text, with only simply text and array-based functions:
$html = file_get_contents('http://stackoverflow.com/questions/25680536');
$tags = explode('<',$html);
foreach ($tags as $tag)
{
// skip scripts
if (strpos($tag,'script') !== FALSE) continue;
// get text
$text = strip_tags('<'.$tag);
// only if text present remember
if (trim($text) != '') $texts[] = $text;
}
print_r($texts);
It ends up in an array, which is usually far more useful than just plain text. You have to do some more post-cleaning, but that's inevitable.
Upvotes: 2