Reputation: 329
So, let's say I have page like this inside of the <body>
tag
<!-- Tag <a> with <img> inside of it -->
<div class="album_item">
<a href="http://www.foo.com/img/1"><img src="http://thumbnail.foo.com/img/1.jpg" /></a>
<a href="http://www.foo.com/img/2"><img src="http://thumbnail.foo.com/img/2.jpg" /></a>
<a href="http://www.foo.com/img/3"><img src="http://thumbnail.foo.com/img/3.jpg" /></a>
<a href="http://www.foo.com/img/4"><img src="http://thumbnail.foo.com/img/4.jpg" /></a>
</div>
<!-- Only tag <img> -->
<div class="album_item">
<img src="http://large.foo.com/img/5.jpg" />
<img src="http://large.foo.com/img/6.jpg" />
</div>
<!-- Combination Of Both Above -->
<div class="album_item">
<a href="http://www.foo.com/img/7"><img src="http://thumbnail.foo.com/img/7.jpg" /></a>
<a href="http://www.foo.com/img/8"><img src="http://thumbnail.foo.com/img/8.jpg" /></a>
<a href="http://www.foo.com/img/9"><img src="http://thumbnail.foo.com/img/9.jpg" /></a>
<a href="http://www.foo.com/img/10"><img src="http://thumbnail.foo.com/img/10.jpg" /></a>
<img src="http://large.foo.com/img/11.jpg" />
<img src="http://large.foo.com/img/12.jpg" />
</div>
And I want to scrap using the code below :
import requests
from bs4 import BeautifulSoup as soup
my_url = 'http://www.foo-url.com'
uClient = requests.get(my_url)
page_html = uClient.text
uClient.close()
page_soup = soup(page_html, "html.parser")
#Identify Each Post Group
containers = page_soup.findAll("div",{"class": "album-item"})
data = []
for container in containers:
#Store Each Pictures To An Object
items = container.findAll("a")
for item in items:
#Set The Link Location
link_location = item.attrs['href']
image_item = item.find("img")
#Set The Image Location
img_location = image_item.attrs['src']
data.append((link_location, img_location))
#Just Incase Only Image
imgs = container.findAll("img")
for img in imgs:
link_location = "NoLink"
img_location = img.attrs['src']
data.append((link_location, img_location))
for link_location, img_location in data:
print(link_location + " | " + img_location)
And On the result, There is a lot of duplicates like this :
http://www.foo.com/img/1 | http://thumbnail.foo.com/img/1.jpg
http://www.foo.com/img/2 | http://thumbnail.foo.com/img/2.jpg
http://www.foo.com/img/3 | http://thumbnail.foo.com/img/3.jpg
http://www.foo.com/img/4 | http://thumbnail.foo.com/img/4.jpg
NoLink | http://thumbnail.foo.com/img/1.jpg #duplicate
NoLink | http://thumbnail.foo.com/img/2.jpg #duplicate
NoLink | http://thumbnail.foo.com/img/3.jpg #duplicate
NoLink | http://thumbnail.foo.com/img/4.jpg #duplicate
NoLink | http://large.foo.com/img/5.jpg
NoLink | http://large.foo.com/img/6.jpg
http://www.foo.com/img/7 | http://thumbnail.foo.com/img/7.jpg
http://www.foo.com/img/8 | http://thumbnail.foo.com/img/8.jpg
http://www.foo.com/img/9 | http://thumbnail.foo.com/img/9.jpg
http://www.foo.com/img/10 | http://thumbnail.foo.com/img/10.jpg
NoLink | http://thumbnail.foo.com/img/7.jpg #duplicate
NoLink | http://thumbnail.foo.com/img/8.jpg #duplicate
NoLink | http://thumbnail.foo.com/img/9.jpg #duplicate
NoLink | http://thumbnail.foo.com/img/10.jpg #duplicate
NoLink | http://large.foo.com/img/11.jpg
NoLink | http://large.foo.com/img/12.jpg
My idea is, to check inside of the <div class="album_item">
if all of the children tag <a>
, then do the for item in items:
else if all of the children tag <img>
, then do the for img in imgs:
but then what if there are both of tag ?
And I am not sure how check that tag either
On the first <div>
I tried to use if(container.select("img"))
which should be false,
but the value is true because it detect the tag <img>
that is inside of tag <a>
So, how should I approach this ?
Upvotes: 2
Views: 1936
Reputation: 7238
The thing you want, is tag.find_all(recursive=False)
.
From the documentation:
If you call
mytag.find_all()
, Beautiful Soup will examine all the descendants ofmytag
: its children, its children’s children, and so on. If you only want Beautiful Soup to consider direct children, you can pass inrecursive=False
.
In your code, change this line
imgs = container.findAll("img")
to
imgs = container.findAll("img", recursive=False)
Output:
http://www.foo.com/img/1 | http://thumbnail.foo.com/img/1.jpg
http://www.foo.com/img/2 | http://thumbnail.foo.com/img/2.jpg
http://www.foo.com/img/3 | http://thumbnail.foo.com/img/3.jpg
http://www.foo.com/img/4 | http://thumbnail.foo.com/img/4.jpg
NoLink | http://large.foo.com/img/5.jpg
NoLink | http://large.foo.com/img/6.jpg
http://www.foo.com/img/7 | http://thumbnail.foo.com/img/7.jpg
http://www.foo.com/img/8 | http://thumbnail.foo.com/img/8.jpg
http://www.foo.com/img/9 | http://thumbnail.foo.com/img/9.jpg
http://www.foo.com/img/10 | http://thumbnail.foo.com/img/10.jpg
NoLink | http://large.foo.com/img/11.jpg
NoLink | http://large.foo.com/img/12.jpg
Upvotes: 2