Reputation: 159
I have this html:
<div id="content">
<h1>Title 1</h1><br><br>
<h2>Sub-Title 1</h2>
<br><br>
Description 1.<br><br>Description 2.
<br><br>
<h2>Sub-Title 2</h2>
<br><br>
Description 1<br>Description 2<br>
<br><br>
<div class="infobox">
<font style="color:#000000"><b>Information Title</b></font>
<br><br>Long Information Text
</div>
</div>
I want to get all html in <div id="content">
in Scrapy but excluding <div class="infobox">
's block, so the expected result is like this:
<div id="content">
<h1>Title 1</h1><br><br>
<h2>Sub-Title 1</h2>
<br><br>
Description 1.<br><br>Description 2.
<br><br>
<h2>Sub-Title 2</h2>
<br><br>
Description 1<br>Description 2<br>
<br><br>
</div>
How can I modify my current selector:
item['article_html'] = hxs.select("//div[@id='content']").extract()[0]
Upvotes: 2
Views: 3045
Reputation: 16910
You can also use the CSS selector :not(...)
to exclude things.
Although completely untested, try something like this:
response.css("div[id='content']:not([class*='infobox'])")
Upvotes: 1
Reputation: 4228
You can manipulate the element tree to remove the offending div:
>>> from scrapy.http import HtmlResponse
>>> body = '''\
... <div id="content">
... <h1>Title 1</h1><br><br>
...
... <h2>Sub-Title 1</h2>
... <br><br>
... Description 1.<br><br>Description 2.
... <br><br>
...
... <h2>Sub-Title 2</h2>
... <br><br>
... Description 1<br>Description 2<br>
... <br><br>
...
... <div class="infobox">
... <font style="color:#000000"><b>Information Title</b></font>
... <br><br>Long Information Text
... </div>
... </div>
... '''
>>> resp = HtmlResponse(url='http://example.com', body=body, encoding='utf8')
>>> xhs = resp.selector
>>> infobox = xhs.css('.infobox')[0].root
>>> infobox.getparent().remove(infobox)
>>> print(xhs.select("//div[@id='content']").extract()[0])
<div id="content">
<h1>Title 1</h1><br><br>
<h2>Sub-Title 1</h2>
<br><br>
Description 1.<br><br>Description 2.
<br><br>
<h2>Sub-Title 2</h2>
<br><br>
Description 1<br>Description 2<br>
<br><br>
</div>
Upvotes: 1
Reputation: 18799
There is no direct way to do this directly with selectors (xpath).
You could do something like this:
content = hxs.select("//div[@id='content']").extract()[0]
infobox = hxs.select("//div[@id='content']//div[@class='infobox']").extract()[0]
item['article_html'] = content.replace(infobox, "")
Upvotes: 2