ahmed
ahmed

Reputation: 14666

how to extract all text from HTML file using PHP?

how to extract all text from HTML file

I want to extract all text, in the alt attributes, < p > tags, etc..

however I don't want to extract the text between style and script tags

Thanks

right now I have the following code

    <?PHP
    $string =  trim(clean(strtolower(strip_tags($html_content))));
    $arr = explode(" ", $string);
    $count = array_count_values($arr);
    foreach($count as $value => $freq) {
          echo trim ($value)."---".$freq."<br>";
    }

    function clean($in){
           return preg_replace("/[^a-z]+/i", " ", $in);
    }

    ?>

This works great but it retrieves script and style tags which I don't want to retrieve and the other problem I am not sure if it does retrieve attributes like alt - since strip_tags function might remove all HTML tags with their attributes

Thanks

Upvotes: 0

Views: 3863

Answers (5)

Martin
Martin

Reputation: 41

I posted this as an answer to another post, but here it is again:

We've just launched a new natural language processing API over at repustate.com. Using a REST API (so just using curl will be fine), you can clean any HTML or PDF and get back just the text parts. Our API is free so feel free to use to your heart's content. Check it out and compare the results to readability.js - I think you'll find they're almost 100% the same.

Upvotes: 0

Alan Plum
Alan Plum

Reputation: 10892

Any kind of parsing is not an option as long as you can't be sure the source is 100% well-formed XML (which HTML4, by definition, is not).

A simple preg_replace should suffice. Something like

preg_replace('/<(script|style).*>.*<\/\1>/i', '', $html);

should be enough to replace all the script and style elements and their contents with an empty string (i.e. strip them).

If you want to avoid XSS attacks, however, you're probably better off using a HTML sanitiser to normalise the HTML and then strip all the bad code.

Upvotes: 0

Sabeen Malik
Sabeen Malik

Reputation: 10880

first you can search for the and blocks and remove them from the html.

i have this function i use alot

        function search($start,$end,$string, $borders=true){
            $reg="!".preg_quote($start)."(.*?)".preg_quote($end)."!is";
            preg_match_all($reg,$string,$matches);

            if($borders) return $matches[0];    
            else return $matches[1];    
        }

the function will return matching blocks in array.

$array = search("<script>" , "</script>" , $html)

once you have the script and styles gone , use strip_tags to get the text

Upvotes: 0

Stefan Gehrig
Stefan Gehrig

Reputation: 83622

I personally think you should switch to an XML reader of some sort (SimpleXML, Document Object Model or XMLReader) to parse the HTML document. I'd go for a mix of DOM, SimpleXML and XPath to extract what you need - everthing else will miserably fail when parsing arbitrary documents:

$dom = new DOMDocument();
$dom->loadHTML($html_content); // use DOMDocument because it can load HTML
$xml = simplexml_import_dom($dom); // switch to SimpleXML because it's easier to use.
$pTags = $xml->xpath('/html/body//p');
$tagsWithAltAttribute = $xml->xpath('/html/body//*[@alt]');
// ...

Upvotes: 7

Andrey Adamovich
Andrey Adamovich

Reputation: 20663

First remove script and style tags with full content, then use your current way of cleaning tags and you'll get the text.

Upvotes: 0

Related Questions