Reputation: 10570
I have this html
<td width="70%">REGEN REAL ESTATE, Dubai – U.A.E
RERA ID: 12087
Specialist Licensed Property Brokers & Consultants
Residential / Commercial – Buying, Selling, R <a href="http://www.justproperty.com/company_view/index/3963">...Read more...</a></td>
I want to get all the text inside the td
normalize-space(td/text())
but I got only last line.
what should I do to get all the lines?
Upvotes: 0
Views: 2117
Reputation: 20748
You can use u"".join(selector.xpath('.//td//text()').extract())
or u"".join(selector.css('td ::text').extract())
I almost forgot the most simple way, if you want every text content of a specific node, you can use normalize-space()
on it directly:
paul@wheezy:~$ ipython
Python 2.7.3 (default, Jan 2 2013, 13:56:14)
Type "copyright", "credits" or "license" for more information.
IPython 0.13.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: from scrapy.selector import Selector
In [2]: selector = Selector(text="""<td width="70%">REGEN REAL ESTATE, Dubai – U.A.E
...:
...: RERA ID: 12087
...:
...: Specialist Licensed Property Brokers & Consultants
...: Residential / Commercial – Buying, Selling, R <a href="http://www.justproperty.com/company_view/index/3963">...Read more...</a></td>""", type="html")
In [3]: selector.xpath("normalize-space(.//td)")
Out[3]: [<Selector xpath='normalize-space(.//td)' data=u'REGEN REAL ESTATE, Dubai \u2013 U.A.E RERA ID'>]
In [4]: selector.xpath("normalize-space(.//td)").extract()
Out[4]: [u'REGEN REAL ESTATE, Dubai \u2013 U.A.E RERA ID: 12087 Specialist Licensed Property Brokers & Consultants Residential / Commercial \u2013 Buying, Selling, R ...Read more...']
In [5]: [td.xpath("normalize-space(.)").extract() for td in selector.css("td")]
Out[5]: [[u'REGEN REAL ESTATE, Dubai \u2013 U.A.E RERA ID: 12087 Specialist Licensed Property Brokers & Consultants Residential / Commercial \u2013 Buying, Selling, R ...Read more...']]
In [7]:
Remember normalize-space()
will consider only the 1st node in the node-set you give as argument, so it usually does what you want if you are sure your argument will match one and only one node you want.
Upvotes: 1
Reputation: 473863
normalize-space(//td/text())
works for me.
Demo (using xmllint):
$ xmllint input.xml --xpath "normalize-space(//td/text())"
REGEN REAL ESTATE, Dubai – U.A.E RERA ID: 12087 Specialist Licensed Property Brokers & Consultants Residential / Commercial – Buying, Selling, R
Where input.xml
contains the xml you've provided.
Upvotes: 1