Reputation: 14666
how to extract all text from HTML file
I want to extract all text, in the alt attributes, < p > tags, etc..
however I don't want to extract the text between style and script tags
Thanks
right now I have the following code
<?PHP
$string = trim(clean(strtolower(strip_tags($html_content))));
$arr = explode(" ", $string);
$count = array_count_values($arr);
foreach($count as $value => $freq) {
echo trim ($value)."---".$freq."<br>";
}
function clean($in){
return preg_replace("/[^a-z]+/i", " ", $in);
}
?>
This works great but it retrieves script and style tags which I don't want to retrieve and the other problem I am not sure if it does retrieve attributes like alt - since strip_tags function might remove all HTML tags with their attributes
Thanks
Upvotes: 0
Views: 3863
Reputation: 41
I posted this as an answer to another post, but here it is again:
We've just launched a new natural language processing API over at repustate.com. Using a REST API (so just using curl will be fine), you can clean any HTML or PDF and get back just the text parts. Our API is free so feel free to use to your heart's content. Check it out and compare the results to readability.js - I think you'll find they're almost 100% the same.
Upvotes: 0
Reputation: 10892
Any kind of parsing is not an option as long as you can't be sure the source is 100% well-formed XML (which HTML4, by definition, is not).
A simple preg_replace should suffice. Something like
preg_replace('/<(script|style).*>.*<\/\1>/i', '', $html);
should be enough to replace all the script and style elements and their contents with an empty string (i.e. strip them).
If you want to avoid XSS attacks, however, you're probably better off using a HTML sanitiser to normalise the HTML and then strip all the bad code.
Upvotes: 0
Reputation: 10880
first you can search for the and blocks and remove them from the html.
i have this function i use alot
function search($start,$end,$string, $borders=true){
$reg="!".preg_quote($start)."(.*?)".preg_quote($end)."!is";
preg_match_all($reg,$string,$matches);
if($borders) return $matches[0];
else return $matches[1];
}
the function will return matching blocks in array.
$array = search("<script>" , "</script>" , $html)
once you have the script and styles gone , use strip_tags to get the text
Upvotes: 0
Reputation: 83622
I personally think you should switch to an XML reader of some sort (SimpleXML
, Document Object Model
or XMLReader
) to parse the HTML document. I'd go for a mix of DOM
, SimpleXML
and XPath to extract what you need - everthing else will miserably fail when parsing arbitrary documents:
$dom = new DOMDocument();
$dom->loadHTML($html_content); // use DOMDocument because it can load HTML
$xml = simplexml_import_dom($dom); // switch to SimpleXML because it's easier to use.
$pTags = $xml->xpath('/html/body//p');
$tagsWithAltAttribute = $xml->xpath('/html/body//*[@alt]');
// ...
Upvotes: 7
Reputation: 20663
First remove script and style tags with full content, then use your current way of cleaning tags and you'll get the text.
Upvotes: 0