Reputation: 103
I have this code:
import urllib
import urllib.request
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
theurl= 'http://es.ninemanga.com/chapter/Dragon%20Ball%20Multiverse/279006.html'
req = Request(theurl + '.html', headers={'User-Agent': 'Mozilla/5.0'})
thepage = urlopen(req).read()
soup = BeautifulSoup(thepage, "html.parser")
for divs in soup.findAll('div', {"class": "pic_box"}):
temp = divs.find('img', {"id" : "manga_pic_1"})
temp1 = temp.get('src')
print(temp1 + "\n")
I want to get all div
tags with a class pic_box
and inside of them all the img
tags and their src
I have done this correctly with soup.findAll('div', {"class": "pic_box"})
and then temp.get('src')
but somehow I get:
http://a8.ninemanga.com/es_manga/43/555/279006/4c58c372ca4561627e5a01f6c841290e.jpg
instead of:
https://c5.ninemanga.com/es_manga/43/555/279006/939559ac8d7af80cf6b4ead0ada4f718.jpg
Are they somehow blocking my request or am I doing something wrong here?
referenced link in theurl variable from which I want to extract 'src'
Upvotes: 1
Views: 81
Reputation: 755
Looks like they can detect scraping requests and block them. Even using a fake agent is not working(i tried). Try out something like Selenium , that can automate browser activity and download it through your browser itself.
Upvotes: 1
Reputation: 1210
the image has unique class attribute - 'manga_pic' get image with manga_pic class
Upvotes: 0