Reputation: 5
I can't understand what i need to do to get the second div inside the second div with bs4. I need to get the div with the date. Thanks for helping.
here's the code:
<div class="featured-item-meta">
<div><strong>Published:</strong></div>
<div>October 14, 2015</div>
<ul class="creatorList">
<li>
<div><strong>Writer:</strong></div>
<div><a href="https://www.marvel.com/comics/creators/10329/g_willow_wilson">G. Willow Wilson</a>, <a href="https://www.marvel.com/comics/creators/12441/marguerite_bennett">Marguerite Bennett</a></div>
</li>
<li>
<div><strong>Cover Artist:</strong></div>
<div><a href="https://www.marvel.com/comics/creators/8825/jorge_molina">Jorge Molina</a></div>
</li>
</ul>
</div>
Upvotes: 0
Views: 72
Reputation: 764
Here is a workaround
text = '<div class="featured-item-meta">\
<div><strong>Published:</strong></div>\
<div>October 14, 2015</div>\
<ul class="creatorList">\
<li>\
<div><strong>Writer:</strong></div>\
<div><a href="https://www.marvel.com/comics/creators/10329/g_willow_wilson">G. Willow Wilson</a>, <a href="https://www.marvel.com/comics/creators/12441/marguerite_bennett">Marguerite Bennett</a></div>\
</li>\
<li>\
<div><strong>Cover Artist:</strong></div>\
<div><a href="https://www.marvel.com/comics/creators/8825/jorge_molina">Jorge Molina</a></div>\
</li>\
</ul>\
</div>'
soap = BeautifulSoup(text,'html.parser')
print(soap.find('div',attrs={'class':'featured-item-meta'})\
.find_all('div')[1].text)
Output:
October 14, 2015
Upvotes: 0
Reputation: 3519
from bs4 import BeautifulSoup as bsp
s = '''
<div class="featured-item-meta">
<div><strong>Published:</strong></div>
<div>October 14, 2015</div>
<ul class="creatorList">
<li>
<div><strong>Writer:</strong></div>
<div><a href="https://www.marvel.com/comics/creators/10329/g_willow_wilson">G. Willow Wilson</a>, <a href="https://www.marvel.com/comics/creators/12441/marguerite_bennett">Marguerite Bennett</a></div>
</li>
<li>
<div><strong>Cover Artist:</strong></div>
<div><a href="https://www.marvel.com/comics/creators/8825/jorge_molina">Jorge Molina</a></div>
</li>
</ul>
</div>
'''
print(bsp(s).find('div').findChildren('div')[1])
Upvotes: 0
Reputation: 1
Well, would be good to see how request that web page. I assume You have Your own way and will mark it as page_text for string format. Anyway for the idea You can write selector like this:
import bs4
page_text = """<div class="featured-item-meta">
<div>
<strong>Published:</strong>
</div>
<div>October 14, 2015</div>
<ul class="creatorList">
<li><div><strong>Writer:</strong></div>
<div><a href="https://www.marvel.com/comics/creators/10329 /g_willow_wilson">G. Willow Wilson</a>, <a href="https://www.marvel.com/comics/creators/12441/marguerite_bennett">Marguerite Bennett</a></div></li>
<li><div><strong>Cover Artist:</strong></div>
<div><a href="https://www.marvel.com/comics/creators/8825/jorge_molina">Jorge Molina</a></div></li>
</ul>
</div>"""
soup = bs4.BeautifulSoup(page_text,'html.parser')
date_without_div = soup.select('div > div')[1].get_text(strip=True)
#Or
date_with_div = soup.select('div > div')[1]
print(date_without_div)
print(date_with_div)
Output
'October 14, 2015'
<div>October 14, 2015</div>
Upvotes: 0
Reputation: 84465
This is easy with bs4 4.7.1 + . You can use :has
and :contains
to get the parent div
which has the child strong
which contains the string Published:
, then use adjacent sibling combinator to get next div
.
from bs4 import BeautifulSoup
html = '''
<div class="featured-item-meta">
<div><strong>Published:</strong></div>
<div>October 14, 2015</div>
<ul class="creatorList">
<li>
<div><strong>Writer:</strong></div>
<div><a href="https://www.marvel.com/comics/creators/10329/g_willow_wilson">G. Willow Wilson</a>, <a href="https://www.marvel.com/comics/creators/12441/marguerite_bennett">Marguerite Bennett</a></div>
</li>
<li>
<div><strong>Cover Artist:</strong></div>
<div><a href="https://www.marvel.com/comics/creators/8825/jorge_molina">Jorge Molina</a></div>
</li>
</ul>
</div>
'''
soup = bs(html, 'lxml')
print(soup.select_one('div:has(strong:contains("Published:")) + div').text)
Upvotes: 1
Reputation: 33384
Grab the text Published:
and then use find_next('div')
to get the date.
from bs4 import BeautifulSoup
html='''<div class="featured-item-meta">
<div><strong>Published:</strong></div>
<div>October 14, 2015</div>
<ul class="creatorList">
<li>
<div><strong>Writer:</strong></div>
<div><a href="https://www.marvel.com/comics/creators/10329/g_willow_wilson">G. Willow Wilson</a>, <a href="https://www.marvel.com/comics/creators/12441/marguerite_bennett">Marguerite Bennett</a></div>
</li>
<li>
<div><strong>Cover Artist:</strong></div>
<div><a href="https://www.marvel.com/comics/creators/8825/jorge_molina">Jorge Molina</a></div>
</li>
</ul>
</div>'''
soup=BeautifulSoup(html,'html.parser')
datetext=soup.find('div' , text='Published:').find_next('div').text
print(datetext)
Output:
October 14, 2015
Upvotes: 0