Reputation: 31903
<div class="t m0 x1c h4 y10f ff2 fs2 fc0 sc0 ls0 ws0">
Kne e
<span class="_ _72">
</span>
<span class="ff3">
102.2°
<span class="_ _8">
</span>
97.5°
<span class="_ _4e">
</span>
99.8°
</span>
</div>
<div class="t m0 xd h4 y110 ff2 fs2 fc0 sc0 ls0 ws0">
A n k l e
<span class="_ _7d">
</span>
<span class="ff3">
46.0°
<span class="_ _17">
</span>
46.3°
<span class="_ _4e">
</span>
33.5°
</span>
</div>
I have a large HTML file as shown above. It contains nested div
s(I just cut a 2 layer nested divs in my example).
The attribute class
name is generated randomly, therefore it's impossible for me to parse specific div.
I am using Beatiful Soup 4 to pull data from html to a plain text file which works fine, but I want to output it nicely, as my Example showed, it is one row with 4 columns, I want to make the output as knee 102.2° 97.5° 99.8°
and then next row is the columns of Ankle.
Below I print out all div
s class attribute names and I observed the first one is the parent and the rest is children. How can I format the children div
text one by one? The parent-children showed in the example is just part of my html, it is nested by other div as well, Thanks!
['t', 'm0', 'xd', 'h4', 'y118', 'ff2', 'fs2', 'fc0', 'sc0', 'ls0', 'ws0'] --> parent div
['t', 'm0', 'x37', 'h3', 'y119', 'ff2', 'fs1', 'fc0', 'sc0', 'ls0', 'ws0']
['t', 'm0', 'x39', 'h4', 'y11a', 'ff2', 'fs2', 'fc0', 'sc0', 'ls0', 'ws0']
['t', 'm0', 'x52', 'h4', 'y11b', 'ff2', 'fs2', 'fc0', 'sc0', 'ls0', 'ws0']
['t', 'm0', 'x11', 'h4', 'y11c', 'ff2', 'fs2', 'fc0', 'sc0', 'ls0', 'ws0']
['t', 'm0', 'x1c', 'h4', 'y11d', 'ff2', 'fs2', 'fc0', 'sc0', 'ls0', 'ws0']
['t', 'm0', 'x1c', 'h4', 'y11e', 'ff2', 'fs2', 'fc0', 'sc0', 'ls0', 'ws0']
['t', 'm0', 'x54', 'h4', 'y11f', 'ff2', 'fs2', 'fc0', 'sc0', 'ls0', 'ws0']
['t', 'm0', 'x11', 'h4', 'y120', 'ff2', 'fs2', 'fc4', 'sc0', 'ls0', 'ws0'] --> this is the knee div
['t', 'm0', 'x1c', 'h4', 'y121', 'ff2', 'fs2', 'fc5', 'sc0', 'ls0', 'ws0'] --> this is ankle div
Upvotes: 0
Views: 1363
Reputation: 13327
In this case, without class names, you can use the css selectors to match a pattern of tags.
If the parent tag is a <div>
you can use soup.select('div > div')
to get the children <div>
nodes and extract the text.
Maybe you need to add more tags in this selector, depending on the html code.
A working example :
from bs4 import BeautifulSoup as soup
html = """
<div>
<div class="t m0 x1c h4 y10f ff2 fs2 fc0 sc0 ls0 ws0">
Kne e
<span class="_ _72">
</span>
<span class="ff3">
102.2°
<span class="_ _8">
</span>
97.5°
<span class="_ _4e">
</span>
99.8°
</span>
</div>
<div class="t m0 xd h4 y110 ff2 fs2 fc0 sc0 ls0 ws0">
A n k l e
<span class="_ _7d">
</span>
<span class="ff3">
46.0°
<span class="_ _17">
</span>
46.3°
<span class="_ _4e">
</span>
33.5°
</span>
</div>
</div>
"""
soup = soup(html, 'lxml')
result = soup.select('div > div')
for res in result:
print(res.get_text().replace(' ','').replace('\n',' '))
# >>> Knee 102.2° 97.5° 99.8°
# Ankle 46.0° 46.3° 33.5°
Upvotes: 1