Reputation: 4893
I'm trying to build a parser, which will scrape data from the table containing informations about drugs, like drug name, form and price. The problem is that some values are missing there, so when I'm scraping the data order get disrupted. Please take a look below to better understand the problem.
Table form:
+---------+----------+-------+
| name | form | price |
+---------+----------+-------+
| aspirin | 3 pills | 1 |
| aspirin | 5 pills | |
| aspirin | 10 pills | 3 |
+---------+----------+-------+
Every price field is a html link, so the html of this table looks like this:
<table>
<tr>
<td>name</td>
<td>form</td>
<td>price</td>
</tr>
<tr>
<td>aspirin</td>
<td>3 pills</td>
<td><a href="http://x.html">1</a></td>
</tr>
<tr>
<td>aspirin</td>
<td>5 pills</td>
<td></td>
</tr>
<tr>
<td>aspirin</td>
<td>10 pills</td>
<td><a href="http://x.html">3</a></td>
</tr>
</table>
What's the best method to extract the price fields from this table, INCLUDING also empty field, to get returned item in this form: ['1', '', '3'].
When using xpath "//table/tr/td[3]/a/text()" the empty fields are omitted and I get this: ['1', '3'].
I was thinking about crawling data using this xpath: "//table/tr/td[3]/" and then transform it in the pipeline. However, I hope there is some simplier solution to this, cause the data I crawl from the original website is a bit more comlicated and as a result I get this:
[u'<td>\r\n\t\t\t\t</td>',
u'<td>\r\n\t\t \r\n \t\t\t\t\t<a class="tooltip-lek" href="#" rel="#tooltip169815" title="Odp\u0142atno\u015b\u0107 po refundacji">3.20</a>\xa0\xa0\xa0\r\n\t\t\t<div style="display:none;" id="tooltip169815">\r\n\t\t\t\t<table>\r\n\t\t\t\t<tbody>\r\n\t\t\t\t\r\n\t\t\t\t<tr>\r\n\t\t\t\t<td style="padding-right:5px;">lek wydawany za odp\u0142atno\u015bci\u0105 rycza\u0142tow\u0105 (3,20 z\u0142) do wysoko\u015bci limitu:</td>\r\n\t\t\t\t<td>we wskazaniach: choroba afektywna dwubiegunowa, schizofrenia</td>\r\n\t\t\t\t</tr>\r\n\t\t\t\t\r\n\t\t\t\t</tbody>\r\n\t\t\t\t</table>\r\n\t\t\t</div>\r\n\t\t\t\t\t\t\t</td>',
u'<td>\r\n\t\t\t\t</td>']
Upvotes: 0
Views: 362
Reputation: 20748
You could do something like
[u''.join(third_cell.xpath('./a/text()|./text()').extract()).strip()
for third_cell in selector.xpath('//table/tr[position()>1]/td[3]')]
i.e. looping on each 3rd cell from each table row (starting from row 2), and joining all text elements into a single string.
You should get [u'1', u'', u'3']
Upvotes: 1