Sam Chahine
Sam Chahine

Reputation: 620

I’d like to scrape the text in nested elements within multiple divs

Using BeautifulSoup4 for python3 I’d like to scrape the text in nested elements within divs. But first, I want to extract links also nested in elements within divs.

How would I go about grabbing a link LINK-I-WANT.COM and an image IMAGE-I-WANT.JPG nested in something like this:

<section class="LINK_CLASS">
    <div class="LINK_CLASS2">
        <div class="LINK_CLASS3">
            <span class="#">random text</span>
            <a href="LINK-I-WANT.COM">
                <img  src="IMAGE-I-WANT.JPG" class="IMG_CLASS"/>
            </a>
        </div>
    </div>
</section>

All the links scraped would then be saved to a list and the, the script will go through each link and find something a long the lines of:

<div class=“CLASS_ONE”>
    <div class=“CLASS_TWO”>
      <ul>
        <li><span>FOO</span>BAR</li>
        <li><span>FOO2</span>BAR2</li>
        <li><span>FOO3</span>BAR3</li>
        <li><span>FOO4</span>BAR4</li>
      </ul>
    </div>
</div>

Using the example above, how would I access the FOO# and BAR# so that when I loop through every link and find the information that each page has (FOO# & BAR#), I can print it to the generated text file, for every link?

Do forgive me if I am making no sense. Here is my attempt at the code, I would greatly appreciate any help.

def spider(max_pages):
    page = 1
    subs = []
    print("Getting links...")
    while page <= max_pages:
        url = "http://example.com" 
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.findAll("section",{"class":"LINK_CLASS"}):

This is the part where I get stuck... if the <a> tag had a class, this would be a lot easier, unfortunately, the <a> tag just has a href, so I have to try and access it by pointing to other elements. I don't know how to look for an element within an element, could someone please help me?

Upvotes: 1

Views: 1904

Answers (1)

alecxe
alecxe

Reputation: 473863

There are multiple ways to locate the desired links in this case. I would make a CSS selector:

for link in soup.select("section.LINK_CLASS > div.LINK_CLASS2 > div.LINK_CLASS3 > a[href]"):
    print(link["href"])

. would check a presence of a class, > is a direct parent-child relationship check. In other words, we are locating the a elements having an href attribute located directly under the div element with LINK_CLASS3 class located directly under the div element with LINK_CLASS2 class located directly inside the section element with LINK_CLASS class.

Upvotes: 1

Related Questions