Reputation: 159
Below is a sample piece of HTML code that I want to scrape with scrapy.
<body>
<h2 class="post-title entry-title">Sample Header</h2>
<div class="entry clearfix">
<div class="sample1">
<p>Hello</p>
</div>
<!--start comment-->
<div class="sample2">
<p>World</p>
</div>
<!--end comment-->
</div>
<ul class="post-categories">
<li><a href="123.html">Category1</a></li>
<li><a href="456.html">Category2</a></li>
<li><a href="789.html">Category3</a></li>
</ul>
</body>
Right now I am using the below working scrapy code:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from isbullshit.items import IsBullshitItem
class IsBullshitSpider(CrawlSpider):
name = 'isbullshit'
start_urls = ['http://sample.com']
rules = [Rule(SgmlLinkExtractor(allow=[r'page/\d+']), follow=True),
Rule(SgmlLinkExtractor(allow=[r'\w+']), callback='parse_blogpost')]
def parse_blogpost(self, response):
hxs = HtmlXPathSelector(response)
item = IsBullshitItem()
item['title'] = hxs.select('//h2[@class="post-title entry-title"]/text()').extract()[0]
item['tag'] = hxs.select('//ul[@class="post-categories"]/li[1]/a/text()').extract()[0]
item['article_html'] = hxs.select("//div[@class='entry clearfix']").extract()[0]
return item
It gives me the following xml output:
<?xml version="1.0" encoding="utf-8"?>
<items>
<item>
<article_html>
<div class="entry clearfix">
<div class="sample1">
<p>Hello</p>
</div>
<!--start comment-->
<div class="sample2">
<p>World</p>
</div>
<!--end comment-->
</div>
</article_html>
<tag>
Category1
</tag>
<title>
Sample Header
</title>
</item>
</items>
I want to know how to achieve the following output:
<?xml version="1.0" encoding="utf-8"?>
<items>
<item>
<article_html>
<div class="entry clearfix">
<div class="sample1">
<p>Hello</p>
</div>
<!--start comment-->
<!--end comment-->
</div>
</article_html>
<tag>
Category1,Category2,Category3
</tag>
<title>
Sample Header
</title>
</item>
</items>
Note: The number of categories depends on the post. In the above example, there are 3 categories. There could be more or less.
Help would be much appreciated. Cheers.
Upvotes: 3
Views: 1929
Reputation: 4118
That's not a job for scrapy
's selectors. If you want to manipulate a DOM, you are better off with BeautifulSoup
:
from bs4 import BeautifulSoup
[...]
item = hxs.select("//div[@class='entry clearfix']").extract()[0]
# Removing div#sample1
soup = BeautifulSoup(item)
# Select supports CSS selection
tags = soup.select('div.sample1')[0] # Assume at least one element found
tags[0].extract() # remove element in-place
sub_total = soup.div.encode()
The second one is a little tricker since couldn't get it working with extracts
but this is one alternative:
soup = BeautifulSoup(item)
temp = soup.div.div.extract() # Get inner tag of interest
soup.div.clear() # Remove all children tags
soup.div.append(temp) # Insert temp tag
sub_total = soup.div.encode()
I know this feels too manual and verbose, but at least is one alternative.
Upvotes: 0
Reputation: 18799
sel = Selector(text=u"<div class="entry clearfix">
<div class="sample1">
<p>Hello</p>
</div>
<!--start comment-->
<div class="sample2">
<p>World</p>
</div>
<!--end comment-->
</div>")
total = sel.xpath('//div[@class="entry clearfix"]').extract_first()
First Part
unwanted_part = sel.xpath('//div[@class="sample2"]').extract_first()
new_total = total.replace(unwanted_part, '')
Second Part
comments = sel.xpath('//div[@class="entry clearfix"]/comment()').extract()
new_total = total
for comment in comments:
new_total = new_total.replace(comment, '')
I don't think this can be done just using xpath or css.
Upvotes: 2