PiTheNumber
PiTheNumber

Reputation: 23542

jQuery parse HTML without loading images

I load HTML from other pages to extract and display data from that page:

$.get('http://example.org/205.html', function (html) {
    console.log( $(html).find('#c1034') );
});

That does work but because of the $(html) my browser tries to load images that are linked in 205.html. Those images do not exist on my domain so I get a lot of 404 errors.

Is there a way to parse the page like $(html) but without loading the whole page into my browser?

Upvotes: 18

Views: 6911

Answers (7)

Thomas Brus
Thomas Brus

Reputation: 941

Actually if you look in the jQuery documentation it says that you can pass the "owner document" as the second argument to $.

So what we can then do is create a virtual document so that the browser does not automatically load the images present in the supplied HTML:

var ownerDocument = document.implementation.createHTMLDocument('virtual');
$(html, ownerDocument).find('.some-selector');

Upvotes: 18

Revadike
Revadike

Reputation: 606

Instead of removing all img elements altogether, you can use the following regex to delete all src attributes instead:

html = html.replace(/src="[^"]*"/ig, "");

Upvotes: 0

Barak Gall
Barak Gall

Reputation: 1560

Sorry for resuscitating an old question, but this is the first result when searching for how to try to stop parsed html from loading external assets.

I took Nik Ahmad Zainalddin's answer, however there is a weakness in it in that any elements in between <script> tags get wiped out.

<script>
</script>
Inert text
<script>
</script>

In the above example Inert text would be removed along with the script tags. I ended up doing the following instead:

html = html.replace(/<\s*(script|iframe)[^>]*>(?:[^<]*<)*?\/\1>/g, "").replace(/(<(\b(img|style|head|link)\b)(([^>]*\/>)|([^\7]*(<\/\2[^>]*>)))|(<\bimg\b)[^>]*>|(\b(background|style)\b=\s*"[^"]*"))/g, "");

Additionally I added the capability to remove iframes.

Hope this helps someone.

Upvotes: 3

Nik
Nik

Reputation: 709

The following regex replace all occurance of <head>, <link>, <script>, <style>, including background and style attribute from data string returned by ajax load.

html = html.replace(/(<(\b(img|style|script|head|link)\b)(([^>]*\/>)|([^\7]*(<\/\2[^>]*>)))|(<\bimg\b)[^>]*>|(\b(background|style)\b=\s*"[^"]*"))/g,"");

Test regex: https://regex101.com/r/nB1oP5/1

I wish there is a a better way to work around (other than using regex replace).

Upvotes: 1

fudesign2008
fudesign2008

Reputation: 39

Using the following way to parse html will load images automatically.

var wrapper = document.createElement('div'),
    html = '.....';
wrapper.innerHTML = html;

If use DomParser to parse html, the images will not be loaded automatically. See https://github.com/panzi/jQuery-Parse-HTML/blob/master/jquery.parsehtml.js for details.

Upvotes: 3

Johan
Johan

Reputation: 35194

You could either use jQuerys remove() method to select the image elements

console.log( $(html).find('img').remove().end().find('#c1034') );

or remove then from the HTML string. Something like

console.log( $(html.replace(/<img[^>]*>/g,"")) );

Regarding background images, you could do something like this:

$(html).filter(function() {
    return $(this).css('background-image') !== ''; 
}).remove();

Upvotes: 1

Bhuvan Rikka
Bhuvan Rikka

Reputation: 2703

Use regex and remove all <img> tags

 html = html.replace(/<img[^>]*>/g,"");

Upvotes: 17

Related Questions