desktopp
desktopp

Reputation: 105

Getting error while web scraping the link

Getting an error while scraping the link given. Can anybody please help me out with the error, And code for scraping web for the link to get all the text data.

from urllib.request import Request, urlopen
link='https://novelfull.com/warriors-promise/chapter-1.html'
req = Request(link) 
webpage = urlopen(req).read()

Upvotes: 0

Views: 974

Answers (2)

lemonhead
lemonhead

Reputation: 5518

Setting the user agent in the header as if calling from browser seems to work to avoid the HTTP 403: Forbidden error, e.g.:

from urllib.request import Request, urlopen
link='https://novelfull.com/warriors-promise/chapter-1.html'
req = Request(link, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
webpage = urlopen(req).read()

You can also see this question for a similar case

Upvotes: 1

Jacob Lee
Jacob Lee

Reputation: 4690

You could try using requests:

>>> import requests
>>> res = requests.get("https://novelfull.com/warriors-promise/chapter-1.html")
>>> res.raise_for_status()
>>> res.text
'\r\n<!DOCTYPE html><html lang="en-US"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>...'

In order to get the content of the page (the actual story, in this case), you would likely need a web scraper, such as BeautifulSoup4 or lxml.

BeautifulSoup4

import bs4
import requests

res = requests.get("https://novelfull.com/warriors-promise/chapter-1.html")
soup = bs4.BeautifulSoup(res.text, features="html.parser")
elem = soup.select("#chapter-content div:nth-child(3) div")[0]
content = elem.getText()

BeautifulSoup4 is a third-party module, so be sure to install it: pip install BeautifulSoup4.

lxml

from urllib.request import urlopen
from lxml import etree

res = urlopen("https://novelfull.com/warriors-promise/chapter-1.html")
htmlparser = etree.HTMLparser()
tree = etree.parse(res, htmlparser)
elem = tree.xpath("//div[@id='chapter-content']//div[3]//div")
content = elem.text

lxml is a third-party module, so be sure to install it: pip install lxml

Upvotes: 1

Related Questions