Scraping empty fields from table

Question

I'm trying to build a parser, which will scrape data from the table containing informations about drugs, like drug name, form and price. The problem is that some values are missing there, so when I'm scraping the data order get disrupted. Please take a look below to better understand the problem.

Table form:

+---------+----------+-------+
|   name  |   form   | price |
+---------+----------+-------+
| aspirin | 3 pills  |   1   |
| aspirin | 5 pills  |       |
| aspirin | 10 pills |   3   |
+---------+----------+-------+

Every price field is a html link, so the html of this table looks like this:



name
form
price


aspirin
3 pills
1


aspirin
5 pills



aspirin
10 pills
3

What's the best method to extract the price fields from this table, INCLUDING also empty field, to get returned item in this form: ['1', '', '3'].

When using xpath "//table/tr/td[3]/a/text()" the empty fields are omitted and I get this: ['1', '3'].

I was thinking about crawling data using this xpath: "//table/tr/td[3]/" and then transform it in the pipeline. However, I hope there is some simplier solution to this, cause the data I crawl from the original website is a bit more comlicated and as a result I get this:

[u'
				',
 u'
		      
      					3.20\xa0\xa0\xa0
			
				
				
				
				
				lek wydawany za odp\u0142atno\u015bci\u0105 rycza\u0142tow\u0105 (3,20 z\u0142) do wysoko\u015bci limitu:
				we wskazaniach:                 choroba afektywna dwubiegunowa, schizofrenia
				
				
				
				
			
							',
 u'
				']

paul trmbrth · Accepted Answer

You could do something like

[u''.join(third_cell.xpath('./a/text()|./text()').extract()).strip()
 for third_cell in selector.xpath('//table/tr[position()>1]/td[3]')]

i.e. looping on each 3rd cell from each table row (starting from row 2), and joining all text elements into a single string.

You should get [u'1', u'', u'3']

Scraping empty fields from table

Answers (1)

Related Questions