Reputation: 197
I would like to ask if is there a proper way to retrieve (do not save/download locally) all the files that are necessary to properly display a given HTML page and their information (page size etc.) with python urllib
? This includes such things as inlined images, sounds, and referenced stylesheets.
I searched and found that wget
can perform the described procedure using --page-requisites
flag but the performance is not the same and I don't want to download anything locally. Furthermore, the flag -O/dev/null
is not working with what I want to achieve.
My final goal is to hit the page (hosted locally), gather page info and move on.
Any tips, reading references is appreciated.
Upvotes: 0
Views: 99
Reputation: 7744
I would recommend Scrapy. It's simple to use and you can set an xpath
to locate and retrieve just the information you need, e.g. inlined images, sounds, and referenced stylesheets.
An example to retrieve text and links
import scrapy
from ikea.items import IkeaItem
class IkeaSpider(scrapy.Spider):
name = 'ikea'
allowed_domains = ['http://www.ikea.com/']
start_urls = ['http://www.ikea.com/']
def parse(self, response):
for sel in response.xpath('//tr/td'):
item = IkeaItem()
item['name'] = sel.xpath('//a/text()').extract()#change here
item['link'] = sel.xpath('//a/@href').extract()
yield item
As you can see you can set an Xpath
to extract just what you want.
For example,
image, item['link'] = sel.xpath('//img').extract()
sound, item['link'] = sel.xpath('//audio').extract()
And as for hosting locally, it would work just as same, you would simply have to change the url
. You can then save the data or do whatever you want.
Upvotes: 1