Reputation: 1393
So I'm trying to extract a table from a website. It's a two column table as follows:
Name Foo
Number Foo123
Address 10
First Drive
London
AB34 5FG
Region United Kingdom
The table doesn't have headers and the "Address" row contains blank cells in the first column for second, city, postcode, etc.
I've managed to get the table, just fine.
table = response.xpath('//table[@id="MemberDetails"]/tr/td//text()')
This is the output:
[<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'Name:\xa0'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'\r\nFoo\xa0\r\n'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'Number:\xa0'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'\r\nFoo123\xa0\r\n'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'Address:\xa0'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'\r\n(10)\xa0\r\n'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'\xa0'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'\r\nFirst Drive\xa0\r\n'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'\xa0'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'\r\nLondon\xa0\r\n'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'\xa0'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'\r\nAB34 5FG\xa0\r\n'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'\xa0'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'\r\nUnited Kingdom\xa0\r\n'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'Region:\xa0'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'\r\nUnited Kingdom\xa0\r\n'>]
However, I'm stumped as to how I can parse the table into a proper structure.
1st Question: Not sure how I can deal with the address field. 2nd Question: This is a two column table. When saving this, I'd like to transpose such that, the "Name, Number, Address, Region" are column headings.
There are 1000's of pages like this that contain similar data.
Appreciate if someone can point me in the right direction.
Upvotes: 0
Views: 630
Reputation: 10666
You can generate a dictionary for all rows in your table:
def parse(self, response):
table_data = {}
current_key = None
for tr in response.xpath('//table[@id="MemberDetails"]//tr'):
key = tr.xpath('string(./td[1])').extract_first()
value = tr.xpath('string(./td[2])').extract_first()
if key:
key = key.strip()
key = key.replace(":", "")
if value:
value = value.strip()
if key:
current_key = key
if current_key in table_data:
table_data[current_key] += '\n' + value
else:
table_data[current_key] = value
print(table_data["Address"])
Upvotes: 1
Reputation: 1485
Let's work with the sample you have provided us with. (Working with this link )
Let's look at how a table line is formatted.
<tr>
<td bgcolor="#7EADAD">
<strong>Membership Name: </strong>
</td>
<td>
MESSRS R J & L A ACTON
</td>
</tr>
This is great for us, we have different attributes for headers and values: headers are within the strong
tags within td
s with bgcolor
values are directly inside their td
s.
Let's check out how the empty lines for the address info look:
<tr>
<td bgcolor="#7EADAD">
<strong> </strong>
</td>
<td>
NORTHUMBERLAND
</td>
</tr>
Great, same structure.
This means we can loop through all tr
s and grab their data depending on the attributes if we need to.
Here's a minimal example that does not deal with links:
for tr in response.xpath('//table[@id="MemberDetails"]/tr'):
header = tr.xpath('td/strong/text()').extract()[0].strip()
value = tr.xpath('td')[1].xpath('text()').extract()[0].strip()
print(u'{} -- {}'.format(header, value))
You next step is to collect the data (make a dictionary? separate lists? write directly to files? It's your choice.).
For dealing with the address: you could make an address variable in the loop for tr
s that is concatenated with the value if the header is empty (assuming only the address headers are empty). Something like this (again< minimal example):
for tr in response.xpath('//table[@id="MemberDetails"]/tr'):
address = ''
header = tr.xpath('td/strong/text()').extract()[0].strip()
value = tr.xpath('td')[1].xpath('text()').extract()[0].strip()
if not header: # empty strings evaluate to False
address += '' + value
You will also have to work out how to deal with headers/values that are stored within a
tags.
Upvotes: 1
Reputation: 947
You can do something like this:
data = {}
rows = response.css('table#MemberDetails tr')
for row in rows:
label = row.css('td:nth-child(1) strong::text').extract_first().strip()
value = row.css('td+td::text').extract_first().strip()
if label:
label = label.replace(':', '')
data[label] = value
else:
data['Address'] = data['Address'] + ', ' + value
print(data)
It does not work on every situation (for example, in your link Herd Completeness of Performance Rating:
label is in a <a>
tag and the value is an image), but you have a beginning of solution :)
Other solution:
name = response.css('td:contains("Membership Name:") + td::text').extract_first().strip()
...
address = response.css('td:contains("Address:") + td::text').extract_first().strip()
second = response.css('tr:contains("Address:") + tr td + td::text').extract_first().strip()
city = response.css('tr:contains("Address:") + tr + tr td + td::text').extract_first().strip()
Upvotes: 1