Reputation: 163
I have a website in the following format:
<html lang="en">
<head>
#anything
</head>
<body>
<div id="div1">
<div id="div2">
<div class="class1">
#something
</div>
<div class="class2">
#something
</div>
<div class="class3">
<div class="sub-class1">
<div id="statHolder">
<div class="Class 1 of 15">
"Name"
<b>Bob</b>
</div>
<div class="Class 2 of 15">
"Age"
<b>24</b>
</div>
# Here are 15 of these kinds
</div>
</div>
</div>
</div>
</div>
</body>
</html>
I want to retrieve all the content in those 15 classes. How do I do that?
Edit: My Current Approach:
import requests
from bs4 import BeautifulSoup
url = 'my-url-here'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
name_box = soup.findAll('div', {"id": "div1"}) #I dont know what to do after this
Expected Output:
Name: Bob
Age: 24
#All 15 entries like this
I am using BeautifulSoup4 for this.
Is there any direct way to get all the contents in <div id="stats">
?
Upvotes: 1
Views: 3305
Reputation: 84465
If you do it according to the actual html of the webpage the following will give you the stats as a dictionary. It takes each element with class pSt
as the key and then moves to the following strong tag to get the associated value.
from bs4 import BeautifulSoup as bs
#html is response.content assuming not dynamic
soup = bs(html, 'html.parser')
stats = {i.text:i.strong.text for i in soup.select('.pSt')}
For your shown html you can use stripped_strings to get the first sibling
from bs4 import BeautifulSoup as bs
html = '''
<html lang="en">
<head>
#anything
</head>
<body>
<div id="div1">
<div id="div2">
<div class="class1">
#something
</div>
<div class="class2">
#something
</div>
<div class="class3">
<div class="sub-class1">
<div id="statHolder">
<div class="Class 1 of 15">
"Name"
<b>Bob</b>
</div>
<div class="Class 2 of 15">
"Age"
<b>24</b>
</div>
# Here are 15 of these kinds
</div>
</div>
</div>
</div>
</div>
</body>
</html>
'''
soup = bs(html, 'html.parser')
stats = {[s for s in i.stripped_strings][0]:i.b.text for i in soup.select('#statHolder [class^=Class]')}
print(stats)
Upvotes: 2
Reputation: 5237
Based on the HTML above, you can try it this way:
import requests
from bs4 import BeautifulSoup
result = {}
url = 'my-url-here'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
stats = soup.find('div', {'id': 'statHolder'})
for data in stats.find_all('div'):
key, value = data.text.split()
result[key.replace('"', '')] = value
print(result)
# Prints:
# [{'Name': 'Bob'}, {'Age': '24'}]
for key, value in result.items():
print(f'{key}: {value}')
# Prints:
# Name: Bob
# Age: 24
This finds the div
with the id
of statHolder
.
Then, we find all div
s inside that div
, and extract the two lines of text (using split
) -- the first line being the key, and the second line being the value. We also remove the double quotes from the value using replace
.
Then, we add the key-value pair to our result
dictionary.
Iterating through this, you can get the desired output as shown.
Upvotes: 2