User1911
User1911

Reputation: 197

Python urllib get HTML page requisites

I would like to ask if is there a proper way to retrieve (do not save/download locally) all the files that are necessary to properly display a given HTML page and their information (page size etc.) with python urllib? This includes such things as inlined images, sounds, and referenced stylesheets.

I searched and found that wget can perform the described procedure using --page-requisites flag but the performance is not the same and I don't want to download anything locally. Furthermore, the flag -O/dev/null is not working with what I want to achieve.

My final goal is to hit the page (hosted locally), gather page info and move on.

Any tips, reading references is appreciated.

Upvotes: 0

Views: 99

Answers (1)

AzyCrw4282
AzyCrw4282

Reputation: 7744

I would recommend Scrapy. It's simple to use and you can set an xpath to locate and retrieve just the information you need, e.g. inlined images, sounds, and referenced stylesheets.

An example to retrieve text and links

import  scrapy
from ikea.items import IkeaItem
class IkeaSpider(scrapy.Spider):
    name = 'ikea'

    allowed_domains = ['http://www.ikea.com/']

    start_urls = ['http://www.ikea.com/']

    def parse(self, response):
        for sel in response.xpath('//tr/td'):
            item = IkeaItem()
            item['name'] = sel.xpath('//a/text()').extract()#change here
            item['link'] = sel.xpath('//a/@href').extract()

            yield item

As you can see you can set an Xpath to extract just what you want.

For example,

image, item['link'] = sel.xpath('//img').extract()

sound, item['link'] = sel.xpath('//audio').extract()

And as for hosting locally, it would work just as same, you would simply have to change the url. You can then save the data or do whatever you want.

Upvotes: 1

Related Questions