Extracting contents of nested
Tags with Beautiful Soup

Question

I am having a hard time using the advantages of beautiful soup for my use case. There are many similar but not always equal nested p tags where I want to get the contents from. Examples as follows:

20normal string
21this text belongs together
22some text (a reference text)that might continue
23more text

24text with (first)two references first.

I need to save the string of the span tag as well as the strings inside the p tag, no matter its styling and if applicable the referencequote. So from examples above I would like to extract:

example = 20, text = 'normal string', reference = []
example = 21, text = 'this text belongs together', reference = []
example = 22, text = 'some text that might continue', reference = ['a reference text']
example = 23, text = 'more text', reference = []
example = 24, text = 'text with two references', reference = ['first', 'second']

What I was trying is to collect all items with the "example" class and then looping though its parents contents.

for span in bs.find_all("span", {"class": "example"}):
    references = []
        for item in span.parent.contents:
            if (type(item) == NavigableString):
                text= item
            elif (item['class'][0]) == 'verse':
                number= int(item.string)
            elif (item['class']) == 'referencequote':
                references.append(item.string)
            else:
                #how to handle  tags?
        verses.append(MyClassObject(n=number, t=text, r=references))

My approach is very prone to error and there might be even more tags like , that I am ignoring right now. The get_text() method unfortunately gives back sth like '22 some text a reference text that might continue'.

There must be an elegant way to extract this information. Could you give me some ideas for other approaches? Thanks in advance!

dabingsou · Accepted Answer

Try this.

from simplified_scrapy.core.regex_helper import replaceReg
from simplified_scrapy import SimplifiedDoc,utils
html = '''
20normal string
21this text belongs together
22some text (a reference text)that might continue
23more text

24text with (first)two references second.
'''
html = replaceReg(html,"<[/]*strong>","") # Pretreatment
doc = SimplifiedDoc(html)
ps = doc.ps
for p in ps:
    text = ''.join(p.spans.nextText())
    text = replaceReg(text,"[()]+","") # Remove ()
    span = p.span # Get first span
    spans = span.getNexts(tag="span").text # Get references
    print (span["class"], span.text, text, spans)

Result:

example 20 normal string []
example 21 this text belongs together []
example 22 some text that might continue ['a reference text']
example 23 more text []
example 24 text with two references. ['first', 'second']

Here are more examples. https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

Extracting contents of nested <p> Tags with Beautiful Soup

Answers (2)

Related Questions

Extracting contents of nested &lt;p&gt; Tags with Beautiful Soup

Answers (2)

Related Questions

Extracting contents of nested <p> Tags with Beautiful Soup