Haifeng Zhang
Haifeng Zhang

Reputation: 31903

How to parse nested tags by beautiful soup?

       <div class="t m0 x1c h4 y10f ff2 fs2 fc0 sc0 ls0 ws0">
            Kne e
            <span class="_ _72">
            </span>
            <span class="ff3">
                102.2°
                <span class="_ _8">
                </span>
                97.5°
                <span class="_ _4e">
                </span>
                99.8°
            </span>
        </div>
        <div class="t m0 xd h4 y110 ff2 fs2 fc0 sc0 ls0 ws0">
                A n k l e
                <span class="_ _7d">
                </span>
                <span class="ff3">
                    46.0°
                    <span class="_ _17">
                    </span>
                    46.3°
                    <span class="_ _4e">
                    </span>
                    33.5°
                </span>
        </div>

I have a large HTML file as shown above. It contains nested divs(I just cut a 2 layer nested divs in my example). The attribute class name is generated randomly, therefore it's impossible for me to parse specific div.

I am using Beatiful Soup 4 to pull data from html to a plain text file which works fine, but I want to output it nicely, as my Example showed, it is one row with 4 columns, I want to make the output as knee 102.2° 97.5° 99.8° and then next row is the columns of Ankle.

Below I print out all divs class attribute names and I observed the first one is the parent and the rest is children. How can I format the children div text one by one? The parent-children showed in the example is just part of my html, it is nested by other div as well, Thanks!

['t', 'm0', 'xd', 'h4', 'y118', 'ff2', 'fs2', 'fc0', 'sc0', 'ls0', 'ws0']  --> parent div
['t', 'm0', 'x37', 'h3', 'y119', 'ff2', 'fs1', 'fc0', 'sc0', 'ls0', 'ws0']
['t', 'm0', 'x39', 'h4', 'y11a', 'ff2', 'fs2', 'fc0', 'sc0', 'ls0', 'ws0']
['t', 'm0', 'x52', 'h4', 'y11b', 'ff2', 'fs2', 'fc0', 'sc0', 'ls0', 'ws0']
['t', 'm0', 'x11', 'h4', 'y11c', 'ff2', 'fs2', 'fc0', 'sc0', 'ls0', 'ws0']
['t', 'm0', 'x1c', 'h4', 'y11d', 'ff2', 'fs2', 'fc0', 'sc0', 'ls0', 'ws0']
['t', 'm0', 'x1c', 'h4', 'y11e', 'ff2', 'fs2', 'fc0', 'sc0', 'ls0', 'ws0']
['t', 'm0', 'x54', 'h4', 'y11f', 'ff2', 'fs2', 'fc0', 'sc0', 'ls0', 'ws0']
['t', 'm0', 'x11', 'h4', 'y120', 'ff2', 'fs2', 'fc4', 'sc0', 'ls0', 'ws0']   --> this is the knee div
['t', 'm0', 'x1c', 'h4', 'y121', 'ff2', 'fs2', 'fc5', 'sc0', 'ls0', 'ws0']   --> this is ankle div

Upvotes: 0

Views: 1363

Answers (1)

PRMoureu
PRMoureu

Reputation: 13327

In this case, without class names, you can use the css selectors to match a pattern of tags.

If the parent tag is a <div> you can use soup.select('div > div') to get the children <div> nodes and extract the text.

Maybe you need to add more tags in this selector, depending on the html code.

A working example :


from bs4 import BeautifulSoup as soup

html = """
<div>
    <div class="t m0 x1c h4 y10f ff2 fs2 fc0 sc0 ls0 ws0">
            Kne e
            <span class="_ _72">
            </span>
            <span class="ff3">
                102.2°
                <span class="_ _8">
                </span>
                97.5°
                <span class="_ _4e">
                </span>
                99.8°
            </span>
        </div>
        <div class="t m0 xd h4 y110 ff2 fs2 fc0 sc0 ls0 ws0">
                A n k l e
                <span class="_ _7d">
                </span>
                <span class="ff3">
                    46.0°
                    <span class="_ _17">
                    </span>
                    46.3°
                    <span class="_ _4e">
                    </span>
                    33.5°
                </span>
        </div>
    </div>
 """

soup = soup(html, 'lxml')
result = soup.select('div > div')

for res in result:
    print(res.get_text().replace(' ','').replace('\n',' '))

# >>> Knee    102.2°   97.5°   99.8°  
#    Ankle    46.0°   46.3°   33.5° 

Upvotes: 1

Related Questions