Reputation: 21
I'm trying to scrape following kind of HTML in BeautifulSoup.
<div …. > <div…..>
<div class=“class1">Jill</div> <div class=“class2">50</div>
<div class=“class1">Jane</div>
<div class=“class1">Joe</div> <div class=“class2">12</div>
</div></div>
Not every person has a second item to scrape so things like soup.find_all("div", attrs={"class": “class2"}) will not work correctly (it will return both 50 and 12 but the 12 is not connected with the right person)
Wanted result (in variables):
Jill 50
Jane
Joe 12
Upvotes: 0
Views: 137
Reputation: 21
This is what I finally used. Works for multiple values and spaces inside class names.
# default values for vars
Item1 = Item2 = Item3 = ""
for item in soup.find_all('div'):
# convert to str for comparison reasons
strItem = str(item)
if strItem.find("class1") > 0 and item.string != None:
if Item1 != "": # if you have None as default change this
print(Item1, Item2, Item3) # or make list, dict, json, csv, sql......
Item2 = Item3 = "" # default values for vars
Item1 = item.string
elif strItem.find("class2") > 0 and item.string != None:
Item2 = item.string
elif strItem.find("class3") > 0 and item.string != None:
Item3 = item.string
# and so on....
# don't forget to process the last one...
print(Item1, Item2, Item3) # # or make list, dict, json, csv, sql......
Upvotes: 0
Reputation: 198
You could get all name('class1') elements and check if they have a corresponding age('class2') element.
from bs4 import BeautifulSoup
html = """
<div class='parent'>
<div class="class1">Jill</div> <div class="class2">50</div>
<div class="class1">Jane</div>
<div class="class1">Joe</div> <div class="class2">12</div>
</div>
"""
soup = BeautifulSoup(html)
name_tags = soup.find_all('div', {'class': 'class1'})
name_age_pairs = []
# Iterate through all 'class1' elements and see if the next sibling is 'class2'
for name_tag in name_tags:
name_next_div = name_tag.find_next('div')
age = None
if 'class2' in name_next_div['class']:
age = int(name_next_div.string)
name_age_pairs.append((name_tag.string, age))
print(name_age_pairs)
name_age_pairs
will contain:
[('Jill', 50), ('Jane', None), ('Joe', 12)]
Where 'None' means there is no age associated with the second person.
Upvotes: 1
Reputation: 1122
Try this:
pairs = []
for div in soup.find_all('div', {'class': 'class1'}):
name = div.text
item = ''
tmp = div.find_next('div')
if 'class2' in tmp['class']:
item = tmp.text
pairs.append([name, item])
Upvotes: 0