versvs
versvs

Reputation: 643

Split html document in pieces parsing html comments with BeautifulSoup

This is a pretty small question that has been almost resolved in a previous question.

Problem is that right now i have and array of comments, but it does not quite what I need. I get an array of comments-content. And I need to get the html in-between.

Say I have something like:

<p>some html here<p>
<!-- begin mark -->
<p>Html i'm interested at.</p>
<p>More html i want to pull out of the document.</p>
<!-- end mark -->
<!-- begin mark -->
<p>This will be pulled later, but we will come to it when I get to pull the previous section.</p>
<!-- end mark -->

In a reply, they point to Crummy explanation on navigating the html tree, but I didnt find there and answer to my problem.

Any ideas? Thanks.

PS. Extra kudos if someone point me an elegant way to repeat the process a few times in a document, as I probably may get it to work, but poorly :D

Edited to add:

With the information provided by Martijn Pieters, I got to pass the comments array obtained using the above code to the generator function he designed. So this gives no error:

for elem in comments:
    htmlcode = allnext(comments)
    print htmlcode

I think now it will be possible to manipulate the htmlcode content before iterating through the array.

Upvotes: 1

Views: 1507

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121584

You can use the .next_sibling pointer to get to the next element. You can use that to find everything following a comment, up to but not including another comment:

from bs4 import Comment

def allnext(comment):
    curr = comment
    while True:
        curr = curr.next_sibling
        if isinstance(curr, Comment):
            return
        yield curr

This is a generator function, you use it to iterate over all 'next' elements:

for elem in allnext(comment):
    print elem

or you can use it to create a list of all next elements:

elems = list(allnext(comment))

Your example is a little too small for BeautifulSoup and it'll wrap each comment in a <p> tag but if we use a snippet from your original target www.gamespot.com this works just fine:

<div class="ad_wrap ad_wrap_dart"><div style="text-align:center;"><img alt="Advertisement" src="http://ads.com.com/Ads/common/advertisement.gif" style="display:block;height:10px;width:120px;margin:0 auto;"/></div>
<!-- start of gamespot gpt ad tag -->
<div id="div-gpt-ad-1359295192-lb-top">
<script type="text/javascript">
        googletag.display('div-gpt-ad-1359295192-lb-top');
    </script>
<noscript>
<a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/6975/row/gamespot.com/home&amp;sz=728x90|970x66|970x150|970x250|960x150&amp;t=pos%3Dtop%26platform%3Ddesktop%26&amp;c=1359295192">
<img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/6975/row/gamespot.com/home&amp;sz=728x90|970x66|970x150|970x250|960x150&amp;t=pos%3Dtop%26platform%3Ddesktop%26&amp;c=1359295192"/>
</a>
</noscript>
</div>
<!-- end of gamespot gpt tag -->
</div>

If comment is a reference to the first comment in that snippet, the allnext() generator gives me:

>>> list(allnext(comment))
[u'\n', <div id="div-gpt-ad-1359295192-lb-top">
<script type="text/javascript">
        googletag.display('div-gpt-ad-1359295192-lb-top');
    </script>
<noscript>
<a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/6975/row/gamespot.com/home&amp;sz=728x90|970x66|970x150|970x250|960x150&amp;t=pos%3Dtop%26platform%3Ddesktop%26&amp;c=1359295192">
<img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/6975/row/gamespot.com/home&amp;sz=728x90|970x66|970x150|970x250|960x150&amp;t=pos%3Dtop%26platform%3Ddesktop%26&amp;c=1359295192"/>
</a>
</noscript>
</div>, u'\n']

Upvotes: 2

Related Questions