Reputation: 643
This is a pretty small question that has been almost resolved in a previous question.
Problem is that right now i have and array of comments, but it does not quite what I need. I get an array of comments-content. And I need to get the html in-between.
Say I have something like:
<p>some html here<p>
<!-- begin mark -->
<p>Html i'm interested at.</p>
<p>More html i want to pull out of the document.</p>
<!-- end mark -->
<!-- begin mark -->
<p>This will be pulled later, but we will come to it when I get to pull the previous section.</p>
<!-- end mark -->
In a reply, they point to Crummy explanation on navigating the html tree, but I didnt find there and answer to my problem.
Any ideas? Thanks.
PS. Extra kudos if someone point me an elegant way to repeat the process a few times in a document, as I probably may get it to work, but poorly :D
Edited to add:
With the information provided by Martijn Pieters, I got to pass the comments
array obtained using the above code to the generator function he designed. So this gives no error:
for elem in comments:
htmlcode = allnext(comments)
print htmlcode
I think now it will be possible to manipulate the htmlcode content before iterating through the array.
Upvotes: 1
Views: 1507
Reputation: 1121584
You can use the .next_sibling
pointer to get to the next element. You can use that to find everything following a comment, up to but not including another comment:
from bs4 import Comment
def allnext(comment):
curr = comment
while True:
curr = curr.next_sibling
if isinstance(curr, Comment):
return
yield curr
This is a generator function, you use it to iterate over all 'next' elements:
for elem in allnext(comment):
print elem
or you can use it to create a list of all next elements:
elems = list(allnext(comment))
Your example is a little too small for BeautifulSoup and it'll wrap each comment in a <p>
tag but if we use a snippet from your original target www.gamespot.com
this works just fine:
<div class="ad_wrap ad_wrap_dart"><div style="text-align:center;"><img alt="Advertisement" src="http://ads.com.com/Ads/common/advertisement.gif" style="display:block;height:10px;width:120px;margin:0 auto;"/></div>
<!-- start of gamespot gpt ad tag -->
<div id="div-gpt-ad-1359295192-lb-top">
<script type="text/javascript">
googletag.display('div-gpt-ad-1359295192-lb-top');
</script>
<noscript>
<a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/6975/row/gamespot.com/home&sz=728x90|970x66|970x150|970x250|960x150&t=pos%3Dtop%26platform%3Ddesktop%26&c=1359295192">
<img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/6975/row/gamespot.com/home&sz=728x90|970x66|970x150|970x250|960x150&t=pos%3Dtop%26platform%3Ddesktop%26&c=1359295192"/>
</a>
</noscript>
</div>
<!-- end of gamespot gpt tag -->
</div>
If comment
is a reference to the first comment in that snippet, the allnext()
generator gives me:
>>> list(allnext(comment))
[u'\n', <div id="div-gpt-ad-1359295192-lb-top">
<script type="text/javascript">
googletag.display('div-gpt-ad-1359295192-lb-top');
</script>
<noscript>
<a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/6975/row/gamespot.com/home&sz=728x90|970x66|970x150|970x250|960x150&t=pos%3Dtop%26platform%3Ddesktop%26&c=1359295192">
<img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/6975/row/gamespot.com/home&sz=728x90|970x66|970x150|970x250|960x150&t=pos%3Dtop%26platform%3Ddesktop%26&c=1359295192"/>
</a>
</noscript>
</div>, u'\n']
Upvotes: 2