Reputation: 558
I have currently a simple webpage which just consists out of a .js, .css .html file. I do not want to use any Node.js stuff.
Regarding these limits I would like to ask if it is possible to search content of external webpages using javascript (e.g. running a webworker in background).
E.g. I would like to do:
Get first url link of a google image search.
Edit:
I now tried it and it worked find however after 2 Weeks I get now this error:
Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at .... (Reason: CORS header ‘Access-Control-Allow-Origin’ missing).
any ideas how to solve that?
Here is the error described by firefox: https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS/Errors/CORSMissingAllowOrigin
Upvotes: 0
Views: 8038
Reputation: 894
I heard about python for scraping too, but nodejs + puppeteer kick ass... And is pretty easy to learn
Upvotes: 0
Reputation: 6426
Yes, this is possible. Just use the XMLHttpRequest
API:
var request = new XMLHttpRequest();
request.open("GET", "https://bypasscors.herokuapp.com/api/?url=" + encodeURIComponent("https://duckduckgo.com/html/?q=stack+overflow"), true); // last parameter must be true
request.responseType = "document";
request.onload = function (e) {
if (request.readyState === 4) {
if (request.status === 200) {
var a = request.responseXML.querySelector("div.result:nth-child(1) > div:nth-child(1) > h2:nth-child(1) > a:nth-child(1)");
console.log(a.href);
document.body.appendChild(a);
} else {
console.error(request.status, request.statusText);
}
}
};
request.onerror = function (e) {
console.error(request.status, request.statusText);
};
request.send(null); // not a POST request, so don't send extra data
Note that I had to use a proxy to bypass CORS issues; if you want to do this, run your own proxy on your own server.
Upvotes: 2
Reputation: 3034
Yes, it is theoretically possible to do “web scraping” (i.e. parsing webpages) on the client. There are several restrictions however and I would question why you wouldn’t choose a program that runs on a server or desktop instead.
Web workers are able to request HTML content using XMLHttpRequest, and then parse the incoming XML programmatically. Note that the target webpage must send the appropriate CORS headers if it belongs to a foreign domain. You could then pick out content from the resulting HTML.
Parsing content generated with CSS and JavaScript will be harder. You will either have to construct sandboxed content on your host page from the input stream, or run some kind of parser, which doesn’t seem very feasible.
In short, the answer to your question is yes, because you have the tools to do a network request and a Turing-complete language with which to build any kind of parsing and scraping that you wanted. So technically anything is possible.
But the real question is: would it be wise? Would you ever choose this approach when other technologies are at hand? Well, no. For most cases I don’t see why you wouldn’t just write a server side program using e.g. headless Chrome.
If you don’t want to use Node - or aren’t able to deploy Node for some reason - there are many web scraping packages and prior art in languages such as Go, C, Java and Python. Search the package manager of your preferred programming language and you will likely find several.
Upvotes: 3