Yonatan Lange
Yonatan Lange

Reputation: 5

how do i pick the second div from the code without any kind of identification?

I can't understand what i need to do to get the second div inside the second div with bs4. I need to get the div with the date. Thanks for helping.

here's the code:

<div class="featured-item-meta">
    <div><strong>Published:</strong></div>
    <div>October 14, 2015</div>
    <ul class="creatorList">
        <li>
            <div><strong>Writer:</strong></div>
            <div><a href="https://www.marvel.com/comics/creators/10329/g_willow_wilson">G. Willow Wilson</a>, <a href="https://www.marvel.com/comics/creators/12441/marguerite_bennett">Marguerite  Bennett</a></div>
        </li>
        <li>
            <div><strong>Cover Artist:</strong></div>
            <div><a href="https://www.marvel.com/comics/creators/8825/jorge_molina">Jorge  Molina</a></div>
        </li>
    </ul>
</div>

Upvotes: 0

Views: 72

Answers (5)

lagripe
lagripe

Reputation: 764

Here is a workaround

text = '<div class="featured-item-meta">\
<div><strong>Published:</strong></div>\
<div>October 14, 2015</div>\
<ul class="creatorList">\
    <li>\
        <div><strong>Writer:</strong></div>\
        <div><a href="https://www.marvel.com/comics/creators/10329/g_willow_wilson">G. Willow Wilson</a>, <a href="https://www.marvel.com/comics/creators/12441/marguerite_bennett">Marguerite  Bennett</a></div>\
    </li>\
    <li>\
        <div><strong>Cover Artist:</strong></div>\
        <div><a href="https://www.marvel.com/comics/creators/8825/jorge_molina">Jorge  Molina</a></div>\
    </li>\
</ul>\
</div>'

soap = BeautifulSoup(text,'html.parser')

print(soap.find('div',attrs={'class':'featured-item-meta'})\
          .find_all('div')[1].text)

Output:

October 14, 2015

Documentation about bs4 here

Upvotes: 0

Poojan
Poojan

Reputation: 3519

from  bs4 import BeautifulSoup as bsp
s = '''
<div class="featured-item-meta">
    <div><strong>Published:</strong></div>
    <div>October 14, 2015</div>
    <ul class="creatorList">
        <li>
            <div><strong>Writer:</strong></div>
            <div><a href="https://www.marvel.com/comics/creators/10329/g_willow_wilson">G. Willow Wilson</a>, <a href="https://www.marvel.com/comics/creators/12441/marguerite_bennett">Marguerite  Bennett</a></div>
        </li>
        <li>
            <div><strong>Cover Artist:</strong></div>
            <div><a href="https://www.marvel.com/comics/creators/8825/jorge_molina">Jorge  Molina</a></div>
        </li>
    </ul>
</div>
'''
print(bsp(s).find('div').findChildren('div')[1])
  • code can slighly change depending upon your full web page and its structure.

Upvotes: 0

JustPC
JustPC

Reputation: 1

Well, would be good to see how request that web page. I assume You have Your own way and will mark it as page_text for string format. Anyway for the idea You can write selector like this:

import bs4
page_text = """<div class="featured-item-meta">
         <div>
           <strong>Published:</strong>
         </div>
         <div>October 14, 2015</div>
         <ul class="creatorList">
             <li><div><strong>Writer:</strong></div>
                 <div><a href="https://www.marvel.com/comics/creators/10329 /g_willow_wilson">G. Willow Wilson</a>, <a href="https://www.marvel.com/comics/creators/12441/marguerite_bennett">Marguerite  Bennett</a></div></li>
             <li><div><strong>Cover Artist:</strong></div>
                 <div><a href="https://www.marvel.com/comics/creators/8825/jorge_molina">Jorge  Molina</a></div></li>
        </ul>
       </div>"""

soup = bs4.BeautifulSoup(page_text,'html.parser')

date_without_div = soup.select('div > div')[1].get_text(strip=True)
#Or
date_with_div = soup.select('div > div')[1]

print(date_without_div)
print(date_with_div)

Output

'October 14, 2015'
<div>October 14, 2015</div>

Upvotes: 0

QHarr
QHarr

Reputation: 84465

This is easy with bs4 4.7.1 + . You can use :has and :contains to get the parent div which has the child strong which contains the string Published:, then use adjacent sibling combinator to get next div.

from bs4 import BeautifulSoup

html = '''
<div class="featured-item-meta">
    <div><strong>Published:</strong></div>
    <div>October 14, 2015</div>
    <ul class="creatorList">
        <li>
            <div><strong>Writer:</strong></div>
            <div><a href="https://www.marvel.com/comics/creators/10329/g_willow_wilson">G. Willow Wilson</a>, <a href="https://www.marvel.com/comics/creators/12441/marguerite_bennett">Marguerite  Bennett</a></div>
        </li>
        <li>
            <div><strong>Cover Artist:</strong></div>
            <div><a href="https://www.marvel.com/comics/creators/8825/jorge_molina">Jorge  Molina</a></div>
        </li>
    </ul>
</div>
'''
soup = bs(html, 'lxml')
print(soup.select_one('div:has(strong:contains("Published:")) + div').text)

Upvotes: 1

KunduK
KunduK

Reputation: 33384

Grab the text Published: and then use find_next('div') to get the date.

from bs4 import BeautifulSoup
html='''<div class="featured-item-meta">
    <div><strong>Published:</strong></div>
    <div>October 14, 2015</div>
    <ul class="creatorList">
        <li>
            <div><strong>Writer:</strong></div>
            <div><a href="https://www.marvel.com/comics/creators/10329/g_willow_wilson">G. Willow Wilson</a>, <a href="https://www.marvel.com/comics/creators/12441/marguerite_bennett">Marguerite  Bennett</a></div>
        </li>
        <li>
            <div><strong>Cover Artist:</strong></div>
            <div><a href="https://www.marvel.com/comics/creators/8825/jorge_molina">Jorge  Molina</a></div>
        </li>
    </ul>
</div>'''

soup=BeautifulSoup(html,'html.parser')
datetext=soup.find('div' , text='Published:').find_next('div').text
print(datetext)

Output:

October 14, 2015

Upvotes: 0

Related Questions