Reputation:
I am looking for status_code in selenium but can't find any code that suits my need. My other problem is that when I enter a domain which does not exists lets say https://gghgjeggeg.com. Selenium does not raises any eror. It's page source is like:-
<html><head></head><body></body></html>
How can I get status code(for valid domains eg:https://twiitter.com/404errpage) as well as raise error for non existing domains in Selenium or is there any other library like Selenium?
Upvotes: 1
Views: 9645
Reputation: 29
For Firefox or Chrome u can use addons for this. We save status code in response cookies and read this cookie on selenium side.
U can read more about browser extensions here:
Chrome: https://developer.chrome.com/extensions/getstarted
Firefox: https://developer.mozilla.org/en-US/docs/Web/Tutorials
NOTE: (Not certificated addons works only with Firefox Dev version, if u want use standard Firefox u must certificate your extension on firefox site.)
Chrome version
//your_js_file_with_extension.js
var targetPage = "*://*/*";
function setStatusCodeDiv(e) {
chrome.cookies.set({
url: e.url,
name: 'status-code',
value: `${e.statusCode}`
});
}
chrome.webRequest.onCompleted.addListener(
setStatusCodeDiv,
{urls: [targetPage], types: ["main_frame"]}
);
manifest:
{
"description": "Save http status code in site cookies",
"manifest_version": 2,
"name": "StatusCodeInCookies",
"version": "1.0",
"permissions": [
"webRequest", "*://*/*", "cookies"
],
"background": {
"scripts": [ "your_js_file_with_extension.js" ]
}
}
Firefox version is almost the same.
//your_js_file_with_extension.js
var targetPage = "*://*/*";
function setStatusCodeDiv(e) {
browser.cookies.set({
url: e.url,
name: 'status-code',
value: `${e.statusCode}`
});
}
browser.webRequest.onCompleted.addListener(
setStatusCodeDiv,
{urls: [targetPage], types: ["main_frame"]}
);
Manifest:
{
"description": "Save http status code in site cookies",
"manifest_version": 2,
"name": "StatusCodeInCookies",
"version": "1.0",
"permissions": [
"webRequest", "*://*/*", "cookies"
],
"background": {
"scripts": [ "your_js_file_with_extension.js" ]
},
"applications": {
"gecko": {
"id": "some_id"
}
}
}
Next u must build this extensions:
For Chrome u must create *.pem and *.crx files (powershell script):
start-Process -FilePath "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" -ArgumentList "--pack-extension=C:\Path\to\your\js\and\manifest"
Firefox (We need only zip file):
[io.compression.zipfile]::CreateFromDirectory('C:\Path\to\your\js\and\manifest', 'destination\folder')
Selenium steps
Ok, when we have extension we can add this to our selenium app. I write our version in C# but i think is easy to rewrite this to other languages (Here u can find Python ver: Using Extensions with Selenium (Python)).
Load extension with Chrome drive:
var options = new ChromeOptions();
options.AddExtension(Path.Combine(System.Environment.CurrentDirectory,@"Selenium\BrowsersExtensions\Compiled\YOUR_CHROME_EXTENSION.crx"));
var chromeDriver = new ChromeDriver(ChromeDriverService.CreateDefaultService(), options);
Load with Firefox (U must use profile):
var profile = new FirefoxProfile();
profile.AddExtension(Path.Combine(System.Environment.CurrentDirectory,@"Selenium\BrowsersExtensions\Compiled\YOUR_FIREFOX_EXTENSION.zip"));
var options = new FirefoxOptions
{
Profile = profile
};
var firefoxDriver = new FirefoxDriver(FirefoxDriverService.CreateDefaultService(), options);
Ok we almost done, now we need read status code from cookies, this should looks something like:
webDriver.Navigate().GoToUrl('your_url');
if (webDriver.Manage() is IOptions options
&& options.Cookies.GetCookieNamed("status-code") is Cookie cookie
&& int.TryParse(cookie.Value, out var statusCode))
{
//we delete cookies after we read status code but this is not necessary
options.Cookies.DeleteCookieNamed("status-code");
return statusCode;
}
logger.Warn($"Can't get http status code from {webDriver.Url}");
return 500;
And this is all. I have not seen anywhere answer like this. Hope I helped.
Upvotes: 1
Reputation: 7634
Selenium is not meant to be used to directly examine HTTP status codes. Selenium is used to interact with the website like a user would do. And the typical user would not open the developer tools and observe the HTTP status code but look at the page content.
I even saw pages responding with a HTTP 200 OK delivering a "resource not found" message to the user.
Even the Selenium developers addressed this:
The browser will always represent the HTTP status code, imagine for example a 404 or a 500 error page. A simple way to “fail fast” when you encounter one of these error pages is to check the page title or content of a reliable point (e.g. the
<h1>
tag) after every page load.
Source: selenium.dev / Worst practices / HTTP response codes
If you insist using Selenium you're better off finding the first h1
element and looking for the typical Chrome 404 signature:
h1 = driver.find_element_by_css_selector('h1')
if h1.text == u"This site can’t be reached":
print("Not found")
Although, if you want to crawl websites, you might even use urllib, like Tek Nath suggested in the comments:
import urllib.request
import urllib.request
import urllib.error
try:
with urllib.request.urlopen('http://www.safasdfsadfsadfdsf.org/') as f:
print(f.read())
print(f.status)
print(f.getheader("content-length"))
except urllib.error.URLError as e:
print(e.reason)
Since the domain is not existing, the code will run into the exception handler branch.
See the Python documentation for details and more examples:
You might then want to use a DOM parser to process the HTML markup to a DOM tree for easier processing. Though this is beyond this question - get started here:
xml.dom
(Python documentation)Upvotes: 1