Reputation: 105
Getting an error while scraping the link given. Can anybody please help me out with the error, And code for scraping web for the link to get all the text data.
from urllib.request import Request, urlopen
link='https://novelfull.com/warriors-promise/chapter-1.html'
req = Request(link)
webpage = urlopen(req).read()
Upvotes: 0
Views: 974
Reputation: 5518
Setting the user agent in the header as if calling from browser seems to work to avoid the HTTP 403: Forbidden
error, e.g.:
from urllib.request import Request, urlopen
link='https://novelfull.com/warriors-promise/chapter-1.html'
req = Request(link, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
webpage = urlopen(req).read()
You can also see this question for a similar case
Upvotes: 1
Reputation: 4690
You could try using requests
:
>>> import requests
>>> res = requests.get("https://novelfull.com/warriors-promise/chapter-1.html")
>>> res.raise_for_status()
>>> res.text
'\r\n<!DOCTYPE html><html lang="en-US"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>...'
In order to get the content of the page (the actual story, in this case), you would likely need a web scraper, such as BeautifulSoup4
or lxml
.
import bs4
import requests
res = requests.get("https://novelfull.com/warriors-promise/chapter-1.html")
soup = bs4.BeautifulSoup(res.text, features="html.parser")
elem = soup.select("#chapter-content div:nth-child(3) div")[0]
content = elem.getText()
BeautifulSoup4
is a third-party module, so be sure to install it: pip install BeautifulSoup4
.
from urllib.request import urlopen
from lxml import etree
res = urlopen("https://novelfull.com/warriors-promise/chapter-1.html")
htmlparser = etree.HTMLparser()
tree = etree.parse(res, htmlparser)
elem = tree.xpath("//div[@id='chapter-content']//div[3]//div")
content = elem.text
lxml
is a third-party module, so be sure to install it: pip install lxml
Upvotes: 1