vucko95
vucko95

Reputation: 103

Python Beautiful Soup img tag inside a div parsing wrong link showing

I have this code:

import urllib
import urllib.request
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

theurl= 'http://es.ninemanga.com/chapter/Dragon%20Ball%20Multiverse/279006.html'

req = Request(theurl  + '.html', headers={'User-Agent': 'Mozilla/5.0'})
thepage = urlopen(req).read()
soup = BeautifulSoup(thepage, "html.parser")


for divs in soup.findAll('div', {"class": "pic_box"}):

    temp = divs.find('img', {"id" : "manga_pic_1"})
    temp1 = temp.get('src')
    print(temp1 + "\n")

I want to get all div tags with a class pic_box and inside of them all the img tags and their src

I have done this correctly with soup.findAll('div', {"class": "pic_box"}) and then temp.get('src') but somehow I get:

http://a8.ninemanga.com/es_manga/43/555/279006/4c58c372ca4561627e5a01f6c841290e.jpg

instead of:

https://c5.ninemanga.com/es_manga/43/555/279006/939559ac8d7af80cf6b4ead0ada4f718.jpg

Are they somehow blocking my request or am I doing something wrong here?

repl to test it

referenced link in theurl variable from which I want to extract 'src'

Upvotes: 1

Views: 81

Answers (2)

cozek
cozek

Reputation: 755

Looks like they can detect scraping requests and block them. Even using a fake agent is not working(i tried). Try out something like Selenium , that can automate browser activity and download it through your browser itself.

Upvotes: 1

JoePythonKing
JoePythonKing

Reputation: 1210

the image has unique class attribute - 'manga_pic' get image with manga_pic class

Upvotes: 0

Related Questions