user506710
user506710

Reputation:

Help in this content extraction + beautiful soup

I am trying to extract data from a site which is in this format

<div id=storytextp class=storytextp align=center style='padding:10px;'> 
<div id=storytext class=storytext> 
<div class='a2a_kit a2a_default_style' style='float:right;margin-left:10px;border:none;'> 
..... extra stuff
</div>  **Main Content**
</div>
</div>

Note that the MainContent can contain other tags but i want the entire content like string

So what i did was this

_divTag = data.find( "div" , id = "storytext" )
innerdiv = _divTag.find( "div" ) # find the first div tag
innerdiv.contents[0].replaceWith("") # replace with null

thus the _divTag will have only the main content but this does not work. Can anybody tell what mistake i am making and how should i extract the main content

Upvotes: 0

Views: 593

Answers (1)

smci
smci

Reputation: 33950

Just do _divTag.contents[2].

Your formatting was maybe misleading you - this text does not belong to the innermost div tag (as innerdiv.text, innerdiv.contents or innerdiv.findChildren() will show you).

It makes things clearer if you indent your original XML:

<div id=storytextp class=storytextp align=center style='padding:10px;'> 
  <div id=storytext class=storytext> 
    <div class='a2a_kit a2a_default_style' style='float:right;margin-left:10px;border:none;'> 
      ..... extra stuff
    </div>  **Main Content**
  </div>
</div>

(PS: I'm not clear what the intent of your innerdiv.contents[0].replaceWith("") was? To squelch the attributes? newlines? Anyway, the BS philosophy is not to edit the parse-tree, but simply to ignore the 99.9% that you don't care about. BS Documentation is here).

Upvotes: 2

Related Questions