Edu V Magadan
Edu V Magadan

Reputation: 69

BeautifulSoup parsing of html list

I'm new to parsing.. I have a simple html without class aributes list like:

    <h2><a href="..">Title 1</a></h2>
    <ol>
        <li>Line 1..</li>
        <li>Line 2...</li>
        ...
    </ol>
    <h2><a href="..">Title 2</a></h2>
    <ol>
        <li>Line 2-1..</li>
        <li>Line 2-2...</li>
        ...
    </ol>
...

and so on..

I run this code:

import requests
from bs4 import BeautifulSoup as BS

r = requests.get('http://...')
html = BS(r.content, 'html.parser')

H2 = html.find_all('h2')
for h2 in H2:
    title = h2.text
    print(title)

to get titles.. but how I can get <ol> list assigned to this title in same loop?

Upvotes: 1

Views: 207

Answers (2)

Andrej Kesely
Andrej Kesely

Reputation: 195408

Another solution: You can use .find_previous:

from bs4 import BeautifulSoup


txt = '''
    <h2><a href="..">Title 1</a></h2>
    <ol>
        <li>Line 1</li>
        <li>Line 2</li>
        ...
    </ol>
    <h2><a href="..">Title 2</a></h2>
    <ol>
        <li>Line 2-1</li>
        <li>Line 2-2</li>
        ...
    </ol>
'''

soup = BeautifulSoup(txt, 'html.parser')

out = {}
for li in soup.select('ol li'):
    out.setdefault(li.find_previous('h2').text, []).append(li.text)

print(out)

Prints:

{'Title 1': ['Line 1', 'Line 2'], 
 'Title 2': ['Line 2-1', 'Line 2-2']}

Upvotes: 1

jizhihaoSAMA
jizhihaoSAMA

Reputation: 12672

An easy way is to use zip.Try:

import requests
from bs4 import BeautifulSoup as BS


source = '''
<h2><a href="..">Title 1</a></h2>
    <ol>
        <li>Line 1..</li>
        <li>Line 2...</li>
    </ol>
    <h2><a href="..">Title 2</a></h2>
    <ol>
        <li>Line 2-1..</li>
        <li>Line 2-2...</li>
    </ol>
'''


html = BS(source, 'html.parser')
for title, element in zip(html.find_all('h2'), html.find_all('ol')):
    print(title.text, element.text)

Result:

Title 1 
Line 1..
Line 2...

Title 2 
Line 2-1..
Line 2-2...

Attention: if the amount of them are different, you could use itertools.zip_longest instead of zip.

Upvotes: 1

Related Questions