AwesomeSam
AwesomeSam

Reputation: 163

Get div content inside a div BeautifulSoup

I have a website in the following format:

<html lang="en">
<head>
    #anything
</head>
<body>
    <div id="div1">
        <div id="div2">
            <div class="class1">
                #something
            </div>
            <div class="class2">
                #something
            </div>
            <div class="class3">
                <div class="sub-class1">
                    <div id="statHolder">
                        <div class="Class 1 of 15">
                            "Name"
                            <b>Bob</b>
                        </div>
                        <div class="Class 2 of 15">
                            "Age"
                            <b>24</b>
                        </div>
                        # Here are 15 of these kinds
                    </div>
                </div>
            </div>
        </div>
    </div>
</body>
</html>

I want to retrieve all the content in those 15 classes. How do I do that?

Edit: My Current Approach:

import requests
from bs4 import BeautifulSoup

url = 'my-url-here'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
name_box = soup.findAll('div', {"id": "div1"}) #I dont know what to do after this

Expected Output:

Name: Bob
Age: 24
#All 15 entries like this

I am using BeautifulSoup4 for this. Is there any direct way to get all the contents in <div id="stats">?

Upvotes: 1

Views: 3305

Answers (2)

QHarr
QHarr

Reputation: 84465

If you do it according to the actual html of the webpage the following will give you the stats as a dictionary. It takes each element with class pSt as the key and then moves to the following strong tag to get the associated value.

from bs4 import BeautifulSoup as bs
#html is response.content assuming not dynamic
soup = bs(html, 'html.parser')
stats = {i.text:i.strong.text for i in soup.select('.pSt')}

For your shown html you can use stripped_strings to get the first sibling

from bs4 import BeautifulSoup as bs

html = '''
<html lang="en">
<head>
    #anything
</head>
<body>
    <div id="div1">
        <div id="div2">
            <div class="class1">
                #something
            </div>
            <div class="class2">
                #something
            </div>
            <div class="class3">
                <div class="sub-class1">
                    <div id="statHolder">
                        <div class="Class 1 of 15">
                            "Name"
                            <b>Bob</b>
                        </div>
                        <div class="Class 2 of 15">
                            "Age"
                            <b>24</b>
                        </div>
                        # Here are 15 of these kinds
                    </div>
                </div>
            </div>
        </div>
    </div>
</body>
</html>
'''
soup = bs(html, 'html.parser')
stats = {[s for s in i.stripped_strings][0]:i.b.text for i in soup.select('#statHolder [class^=Class]')}
print(stats)

Upvotes: 2

costaparas
costaparas

Reputation: 5237

Based on the HTML above, you can try it this way:

import requests
from bs4 import BeautifulSoup

result = {}
url = 'my-url-here'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
stats = soup.find('div', {'id': 'statHolder'})
for data in stats.find_all('div'):
    key, value = data.text.split()
    result[key.replace('"', '')] = value

print(result)
# Prints:
# [{'Name': 'Bob'}, {'Age': '24'}]

for key, value in result.items():
    print(f'{key}: {value}')
# Prints: 
# Name: Bob
# Age: 24

This finds the div with the id of statHolder.

Then, we find all divs inside that div, and extract the two lines of text (using split) -- the first line being the key, and the second line being the value. We also remove the double quotes from the value using replace.

Then, we add the key-value pair to our result dictionary.

Iterating through this, you can get the desired output as shown.

Upvotes: 2

Related Questions