Reputation:
I am trying to extract data from a site which is in this format
<div id=storytextp class=storytextp align=center style='padding:10px;'>
<div id=storytext class=storytext>
<div class='a2a_kit a2a_default_style' style='float:right;margin-left:10px;border:none;'>
..... extra stuff
</div> **Main Content**
</div>
</div>
Note that the MainContent can contain other tags but i want the entire content like string
So what i did was this
_divTag = data.find( "div" , id = "storytext" )
innerdiv = _divTag.find( "div" ) # find the first div tag
innerdiv.contents[0].replaceWith("") # replace with null
thus the _divTag will have only the main content but this does not work. Can anybody tell what mistake i am making and how should i extract the main content
Upvotes: 0
Views: 593
Reputation: 33950
Just do _divTag.contents[2]
.
Your formatting was maybe misleading you - this text does not belong to the innermost div tag (as innerdiv.text
, innerdiv.contents
or innerdiv.findChildren()
will show you).
It makes things clearer if you indent your original XML:
<div id=storytextp class=storytextp align=center style='padding:10px;'>
<div id=storytext class=storytext>
<div class='a2a_kit a2a_default_style' style='float:right;margin-left:10px;border:none;'>
..... extra stuff
</div> **Main Content**
</div>
</div>
(PS: I'm not clear what the intent of your innerdiv.contents[0].replaceWith("")
was? To squelch the attributes? newlines? Anyway, the BS philosophy is not to edit the parse-tree, but simply to ignore the 99.9% that you don't care about. BS Documentation is here).
Upvotes: 2