Reputation: 186
I want to extract URLs from a particular website using scrapy in python which has the following HTML structure
<div class="comic-table">
<div id="comic">
<img src="" alt="" title="">
<img src="" alt="" title="">
here is the scrapy code I have written:
import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from Pencils.items import PencilsItem
class Spider(CrawlSpider):
name = 'pencil'
allowed_domains = ['']
start_urls = ['']
rules = [Rule(LinkExtractor(allow=['/uploads/.*']), 'parse_pencil')]
def parse_pencil(self, response):
image = PencilsItem()
rel = response.xpath("WHAT_SHOULD_I_PUT_HERE").extract()
image['image_urls'] = ['http:'+rel[0]]
return image
what Should I put in the response.xpath field.
P.S I'm a beginner in HTML and Python
Upvotes: 1
Views: 1571
Reputation: 48649
Try this:
// => search the whole html page
@ => attribute
That xpath looks for all <div>
tags which have an attribute named id
which is equal to "comic"
(there should only be one <div>
tag with the attribute id="comic"
because an id should be unique), and extracts the <img>
tags therein.
With scrapy you can do something like the following to get all the <img>
import scrapy
class TestSpider(scrapy.Spider):
name = "my_spider"
start_urls = [
def parse(self, response):
for selector in response.xpath('//div[@id="comic"]/img'):
src = selector.xpath('@src').extract()
print src[0]
(scrapy_env)~/python_programs/scrapy_stuff$ scrapy crawl my_spider
2016-03-29 02:19:09 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapy_stuff)
2016-03-29 02:19:09 [scrapy] INFO: Optional features available: ssl, http11
2016-03-29 02:19:09 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scrapy_stuff.spiders', 'SPIDER_MODULES': ['scrapy_stuff.spiders'], 'BOT_NAME': 'scrapy_stuff'}
2016-03-29 02:19:09 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-03-29 02:19:09 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-29 02:19:09 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-29 02:19:09 [scrapy] INFO: Enabled item pipelines:
2016-03-29 02:19:09 [scrapy] INFO: Spider opened
2016-03-29 02:19:09 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-29 02:19:09 [scrapy] DEBUG: Telnet console listening on
2016-03-29 02:19:09 [scrapy] DEBUG: Crawled (200) <GET file:///Users/7stud/python_programs/scrapy_stuff/html_files/html.html> (referer: None)
2016-03-29 02:19:09 [scrapy] INFO: Closing spider (finished)
2016-03-29 02:19:09 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 263,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 243,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 3, 29, 8, 19, 9, 251971),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 3, 29, 8, 19, 9, 139531)}
2016-03-29 02:19:09 [scrapy] INFO: Spider closed (finished)
And in fact, if all you want is the src attribute from the <img>
tags, you can get the src attributes directly using the following xpath:
def parse(self, response):
for selector in response.xpath('//div[@id="comic"]/img/@src'):
print selector.extract()
2016-03-29 02:33:56 [scrapy] DEBUG: Crawled (200) <GET file:///Users/7stud/python_programs/scrapy_stuff/html_files/html.html> (referer: None)
2016-03-29 02:33:57 [scrapy] INFO: Closing spider (finished)
P.S I'm a beginner in HTML and Python
What about xml and xpath? The subject you really need to explore is xpath. But, I would suggest that as a beginner to html and xpath you should start with BeautifulSoup for scraping web pages.
Upvotes: 2
Reputation: 5191
In order to get all links you should use
and you code will look like
import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from stackoverflow.items import PencilsItem
class Spider(CrawlSpider):
name = 'pencil'
allowed_domains = ['']
start_urls = ['']
rules = [Rule(LinkExtractor(allow=['/uploads/.*']), 'parse_pencil')]
def parse_pencil(self, response):
item = PencilsItem()
item['image_urls'] = response.xpath("//div[@id='comic']/img/@src").extract()
yield item
use this code if img src doesn't contain domain
from urlparse import urlparse
parsed_uri = urlparse(response.url)
domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
links = [domain+link for link in response.xpath("//div[@id='comic']/img/@src").extract()]
Upvotes: 0