Help in this content extraction + beautiful soup

Question

I am trying to extract data from a site which is in this format

 
 
 
..... extra stuff
  **Main Content**

Note that the MainContent can contain other tags but i want the entire content like string

So what i did was this

_divTag = data.find( "div" , id = "storytext" )
innerdiv = _divTag.find( "div" ) # find the first div tag
innerdiv.contents[0].replaceWith("") # replace with null

thus the _divTag will have only the main content but this does not work. Can anybody tell what mistake i am making and how should i extract the main content

smci · Accepted Answer

Just do _divTag.contents[2].

Your formatting was maybe misleading you - this text does not belong to the innermost div tag (as innerdiv.text, innerdiv.contents or innerdiv.findChildren() will show you).

It makes things clearer if you indent your original XML:

 
   
     
      ..... extra stuff
      **Main Content**

(PS: I'm not clear what the intent of your innerdiv.contents[0].replaceWith("") was? To squelch the attributes? newlines? Anyway, the BS philosophy is not to edit the parse-tree, but simply to ignore the 99.9% that you don't care about. BS Documentation is here).

Help in this content extraction + beautiful soup

Answers (1)

Related Questions