Reputation: 620
Using BeautifulSoup4 for python3 I’d like to scrape the text in nested elements within divs. But first, I want to extract links also nested in elements within divs.
How would I go about grabbing a link LINK-I-WANT.COM
and an image IMAGE-I-WANT.JPG
nested in something like this:
<section class="LINK_CLASS">
<div class="LINK_CLASS2">
<div class="LINK_CLASS3">
<span class="#">random text</span>
<a href="LINK-I-WANT.COM">
<img src="IMAGE-I-WANT.JPG" class="IMG_CLASS"/>
</a>
</div>
</div>
</section>
All the links scraped would then be saved to a list and the, the script will go through each link and find something a long the lines of:
<div class=“CLASS_ONE”>
<div class=“CLASS_TWO”>
<ul>
<li><span>FOO</span>BAR</li>
<li><span>FOO2</span>BAR2</li>
<li><span>FOO3</span>BAR3</li>
<li><span>FOO4</span>BAR4</li>
</ul>
</div>
</div>
Using the example above, how would I access the FOO#
and BAR#
so that when I loop through every link and find the information that each page has (FOO# & BAR#), I can print it to the generated text file, for every link?
Do forgive me if I am making no sense. Here is my attempt at the code, I would greatly appreciate any help.
def spider(max_pages):
page = 1
subs = []
print("Getting links...")
while page <= max_pages:
url = "http://example.com"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll("section",{"class":"LINK_CLASS"}):
This is the part where I get stuck... if the <a>
tag had a class, this would be a lot easier, unfortunately, the <a>
tag just has a href, so I have to try and access it by pointing to other elements. I don't know how to look for an element within an element, could someone please help me?
Upvotes: 1
Views: 1904
Reputation: 473863
There are multiple ways to locate the desired links in this case. I would make a CSS selector:
for link in soup.select("section.LINK_CLASS > div.LINK_CLASS2 > div.LINK_CLASS3 > a[href]"):
print(link["href"])
.
would check a presence of a class, >
is a direct parent-child relationship check. In other words, we are locating the a
elements having an href
attribute located directly under the div
element with LINK_CLASS3
class located directly under the div
element with LINK_CLASS2
class located directly inside the section
element with LINK_CLASS
class.
Upvotes: 1