Reputation: 9120
I'm trying to figure out how to get all the elements of html. For example, if I load this google search, I'll see this result:
Looking at the source code for that particular section of the page, I saw this:
<a href="https://www.macworld.com/article/3331839/iphone-2019-rumors-everything-you-need-to-know.html" onmousedown="return rwt(this,'','','','38','AOvVaw07dY5FgPEzcYsd8enm-9gs','','2ahUKEwicoNi4yPjhAhVdCTQIHVxICj4QFjAlegQIABAB','','',event)">
<h3 class="LC20lb">iPhone 2019 rumors: Everything you need to know | Macworld</h3><br><div class="TbwUpd">
<cite class="iUh30">https://www.macworld.com/.../iphone-2019-rumors-everything-you-need-to-know.ht...</cite></div></a>
But if I use document.documentElement.innerHTML
, I see this:
<div class="g"><h3 class="r">
<a href="/url?q=https://www.macworld.com/article/3331839/iphone-2019-rumors-everything-you-need-to-know.html&sa=U&ved=0ahUKEwiU__rUy_jhAhWIHzQIHTrGBzIQFghLMAo&usg=AOvVaw2C3PdwxIaeNuukMVSwC-5g">
<b>iPhone 2019</b> rumors: Everything you need to know | Macworld</a>
</h3><div class="s"><div class="hJND5c" style="margin-bottom:2px">
My question: why is there a difference between the source code and the output from document.documentElement.innerHTML
?
Also, it looks like this when using JavaScript:
<a href="https://www.macworld.com/article/3331839/iphone-2019-rumors-everything-you-need-to-know.html" onmousedown="return rwt(this,'','','','38','AOvVaw07dY5FgPEzcYsd8enm-9gs','','2ahUKEwicoNi4yPjhAhVdCTQIHVxICj4QFjAlegQIABAB','','',event)">
<h3 class="LC20lb">iPhone 2019 rumors: Everything you need to know | Macworld</h3><br><div class="TbwUpd">
<cite class="iUh30">https://www.macworld.com/.../iphone-2019-rumors-everything-you-need-to-know.ht...</cite></div></a>
Upvotes: 4
Views: 3551
Reputation: 1044
The returned HTML or XML fragment is generated based on the current contents of the element, so the markup and formatting of the returned fragment is likely not to match the original page markup.
Upvotes: 0
Reputation: 133
To me, it looks like certain part of the page is dynamically generated through script at client end and that this script is stored at server side other than google. Therefore you might have to run through CORS policy problem. So, "document.documentElement.innerHTML" will only show the static elements of the page that was written initially to be shown at client side, leaving the script that generated the other elements dynamically.
Upvotes: 0
Reputation: 2812
I wasn't able to re-produce you problem, in my case source code showed exactly the same as document.documentElement.innerHTML. So, I don't really know why in this particular example you have this particular problem.
Even though, source-code of the page frequently may have nothing to do with document's innerHTML.
innerHTML have at least 2 inaccuracies:
For example, here you have the source code of a sample React App.
<body>
<div id="app"></div>
<script src="main.js"></script>
</body>
And here's the output it produces:
In this case, the source is completely different from the innerHTML since we generate new things with js.
However, it'd also be different if we would modify existing markup with JS & It's probable that this is the case with Google's result page.
For example, if I sent a bad HTML from the server like this:
<head>...</head>
<!DOCTYPE html>
<html lang="en">
<body>...</body>
</html>
Then document.documentElement.innerHTML will nicely output my bad markup like this:
<head>...</head>
<body>...</body>
This one probably doesn't affect Google's page but it also worth considering when you build something on the basis of document's innerHTML.
So, if what you really want is the source code of the page, probably, you just need to fetch it from the server directly & just get text out of the response.
In client-side JS you can do so with fetch API. The only problem is that you might not be able to do so from an origin different from google.com since you might run into CORS policy problem.
From the server-side, you certainly would have a tool to do a GET request. So, you might use something like http.get in NodeJs or file_get_contents() in PHP.
Upvotes: 1
Reputation: 11
Google's HTML tags are way more complex than what you're looking for, but I assume you want something like this
x = document.querySelectorAll('.g')
x.forEach(function(element) {
console.log(element.outerHTML);
});
Upvotes: 0