Dinossauro.Bebado
Dinossauro.Bebado

Reputation: 3

Why the get method isnt responding the proper html?

I'am trying to scrape my college html. But the html i'am receving is diferent from the page.

The page : https://sistemas2.utfpr.edu.br/login

 <!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <title>Sistemas Corporativos UTFPR</title>
  <base href="/">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <link rel="icon" type="image/x-icon" href="favicon.ico">
  <link rel="stylesheet" type="text/css" href="//fonts.googleapis.com/css?family=Open+Sans" />
<link rel="stylesheet" href="styles.3a2559ba4017ef614370.css"></head>
<body>
  <app-root>Carregando...</app-root>
<script type="text/javascript" src="runtime.26209474bfa8dc87a77c.js"></script><script type="text/javascript" src="es2015-polyfills.36e383bb4535eafdd520.js" nomodule></script><script type="text/javascript" src="polyfills.8ecf09a1095b0f08eb97.js"></script><script type="text/javascript" src="main.5555e7ab911504c65346.js"></script></body>
</html>

My guess is that somehow the page is being redirect, i just dont know what to do.

Upvotes: 0

Views: 62

Answers (1)

CryptoFool
CryptoFool

Reputation: 23089

Your problem is that your code doesn't go far enough in displaying what it gets back from the HTTP query it executes. There are script tags in the response page. To fully render the page, those script tags need to be executed by the appropriate language...usually Javascript. Because you aren't doing that, what you are seeing is a raw form of the page prior to the code in the page being executed.

I know a lot about this because I've been doing web scraping for many years now. There's never been any way that a Python, Java, or any other library programmer can keep up with the likes of the engineers working on Google Chrome, Firefox, or Internet Explorer even. For this reason, most advanced scraping is done these days by scripting these big boys...by running one of these browsers and scripting them, often in headless mode (ie: with nothing happening on a screen anywhere), thereby getting whatever you or I see when we do the same thing manually in our favorite browsers.

And it isn't just about running Javascript. There are tons of other things to deal with. Multiple frames on the page, cookies that have to be kept track of, and lots of other stuff. It's better to leave all this to the main browser creators. They have big teams that can keep up with the constant enhancements being made to the specifications that describe what browsers can and should do.

There's a really cool library called "Puppeteer" that presents a very clean API to the programmer, and uses a fairly standard browser API supported by both Chrome and Firefox (experimental) to script these browsers and extract the results of doing so. That's what I'd recommend that you take a look at if you want to get serious about web scraping.

This library has been ported to Python as "Pypeteer". I don't know how good it is. Our team chose to break down and use Javascript even though we were Python and Java programmers, because the Puppeteer Javascript solution is the most advanced and bug-free, and probably always will be. Let me know if you get much experience with Pyppeteer. I'd like to know how that goes.

Upvotes: 1

Related Questions