Reputation: 18800
I want to extract:
image
tag anddiv
class dataI successfully manage to extract the img src, but am having trouble extracting the text from the anchor tag.
<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&ie=UTF8&qid=1343628292&sr=1-1&keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>
Here is the link for the entire HTML page.
Here is my code:
for div in soup.findAll('div', attrs={'class':'image'}):
print "\n"
for data in div.findNextSibling('div', attrs={'class':'data'}):
for a in data.findAll('a', attrs={'class':'title'}):
print a.text
for img in div.findAll('img'):
print img['src']
What I am trying to do is extract the image src (link) and the title inside the div class=data
, so for example:
<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&ie=UTF8&qid=1343628292&sr=1-1&keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>
should extract:
Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)
Upvotes: 56
Views: 178473
Reputation: 871
To get the href out of an anchor tag use tag.get("href")
and to get the img src you use tag.img.get("src")
.
Example, using this data:
data = """
<div class="image">
<a href="http://www.example.com/eg1">Content1<img src="http://image.example.com/img1.jpg" /></a>
</div>
<div class="image">
<a href="http://www.example.com/eg2">Content2<img src="http://image.example.com/img2.jpg" /> </a>
</div>
"""
Get the links and texts:
import requests
from bs4 import BeautifulSoup
def get_soup(url):
response = requests.get(url)
if response.ok:
return BeautifulSoup(response.text, features="html.parser")
def get_links(soup):
links = []
for tag in soup.findAll("a", href=True):
if img := tag.img:
img = img.get("src")
links.append(dict(url=tag.get("href"), text=tag.text, img=img))
return links
# soup = get_soup('www.example.com')
soup = BeautifulSoup(data, features="html.parser")
links = get_links(soup)
Outputs:
[{'url': 'http://www.example.com/eg1', 'text': 'Content1', 'img': 'http://image.example.com/img1.jpg'},
{'url': 'http://www.example.com/eg2', 'text': 'Content2 ', 'img': 'http://image.example.com/img2.jpg'}]
Upvotes: 3
Reputation: 747
print(link_addres.contents[0])
It will print the context of the anchor tags
example:
statement_title = statement.find('h2',class_='briefing-statement__title')
statement_title_text = statement_title.a.contents[0]
Upvotes: 1
Reputation: 10923
This will help:
from bs4 import BeautifulSoup
data = '''<div class="image">
<a href="http://www.example.com/eg1">Content1<img
src="http://image.example.com/img1.jpg" /></a>
</div>
<div class="image">
<a href="http://www.example.com/eg2">Content2<img
src="http://image.example.com/img2.jpg" /> </a>
</div>'''
soup = BeautifulSoup(data)
for div in soup.findAll('div', attrs={'class':'image'}):
print(div.find('a')['href'])
print(div.find('a').contents[0])
print(div.find('img')['src'])
If you are looking into Amazon products then you should be using the official API. There is at least one Python package that will ease your scraping issues and keep your activity within the terms of use.
Upvotes: 86
Reputation: 2394
In my case, it worked like that:
from BeautifulSoup import BeautifulSoup as bs
url="http://blabla.com"
soup = bs(urllib.urlopen(url))
for link in soup.findAll('a'):
print link.string
Hope it helps!
Upvotes: 34
Reputation: 2606
I would suggest going the lxml route and using xpath.
from lxml import etree
# data is the variable containing the html
data = etree.HTML(data)
anchor = data.xpath('//a[@class="title"]/text()')
Upvotes: 8
Reputation: 18800
All the above answers really help me to construct my answer, because of this I voted for all the answers that other users put it out: But I finally put together my own answer to exact problem I was dealing with:
As question clearly defined I had to access some of the siblings and its children in a dom structure: This solution will iterate over the images in the dom structure and construct image name using product title and save the image to the local directory.
import urlparse
from urllib2 import urlopen
from urllib import urlretrieve
from BeautifulSoup import BeautifulSoup as bs
import requests
def getImages(url):
#Download the images
r = requests.get(url)
html = r.text
soup = bs(html)
output_folder = '~/amazon'
#extracting the images that in div(s)
for div in soup.findAll('div', attrs={'class':'image'}):
modified_file_name = None
try:
#getting the data div using findNext
nextDiv = div.findNext('div', attrs={'class':'data'})
#use findNext again on previous object to get to the anchor tag
fileName = nextDiv.findNext('a').text
modified_file_name = fileName.replace(' ','-') + '.jpg'
except TypeError:
print 'skip'
imageUrl = div.find('img')['src']
outputPath = os.path.join(output_folder, modified_file_name)
urlretrieve(imageUrl, outputPath)
if __name__=='__main__':
url = r'http://www.amazon.com/s/ref=sr_pg_1?rh=n%3A172282%2Ck%3Adigital+camera&keywords=digital+camera&ie=UTF8&qid=1343600585'
getImages(url)
Upvotes: 7
Reputation: 142106
>>> txt = '<a class="title" href="http://rads.stackoverflow.com/amzn/click/B0073HSK0K">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> '
>>> fragment = bs4.BeautifulSoup(txt)
>>> fragment
<a class="title" href="http://rads.stackoverflow.com/amzn/click/B0073HSK0K">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>
>>> fragment.find('a', {'class': 'title'})
<a class="title" href="http://rads.stackoverflow.com/amzn/click/B0073HSK0K">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>
>>> fragment.find('a', {'class': 'title'}).string
u'Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)'
Upvotes: 1