prashant
prashant

Reputation: 3608

Counting words with embeded html in php

I have some fairly large paragraphs (5000-6000 words) containing text and embedded html tags. I want to break this large paragraph in chunks of 1500 words (ignoring the html markup in it) i.e 1500 should include only actual words and not any markup words. Using function strip_tags i can count the number of words (ignoring the html markup), but i'm not able to figure out how to break it in chunks of 1500 words (still including html markup). For example

This is <b> a </b> paragraph which <a href="#"> has some </a> some text to be broken in <h1> 5 words </h1>.

The result should be

1 = This is <b> a </b> paragraph which
2 = <a href="#"> has some </a> some text to
3 = be broken in <h1> 5 words </h1>. 

Upvotes: 4

Views: 238

Answers (3)

dualed
dualed

Reputation: 10502

Use an XML DOM Parser or an HTML DOM Parser.

  • Iterate over all nodes
  • Count words for each node
  • If words exceeds N
    • create new node of parent type
    • insert that as sibling after parent
    • move current and all subsequent siblings to it.
  • move to next element

Upvotes: 0

glenatron
glenatron

Reputation: 11352

I think you're going to need to parse your html if you want to guarantee valid markup. In which case this question should provide a really useful starting point.

Upvotes: 1

Serge Kuharev
Serge Kuharev

Reputation: 1052

Think about using explode() function wisely. Or better, but longer - regular expression that will match either a word or a tag with all text within it. You should consider elements inside html tags as unbreakable entity. For example, you can write a function, that breaks you large paragraph into following array of entities:

$data = array(
  array( "count" => 2, "text" => "This is "),
  array( "count" => 1, "text" => "<b> a </b>"),
  array( "count" => 2, "text" => " paragraph which"),
  ...
  etc.
);

Then, you should write a loop, that will make small paragraphs from $data array.

Also, sometimes it won't be possible to make your paragraph exactly 1500 words long. It can be more or less, because you should not separate you html tags.

Upvotes: 2

Related Questions