Kvothe
Kvothe

Reputation: 1393

Scrapy extracting rows from table without headers

So I'm trying to extract a table from a website. It's a two column table as follows:

Name      Foo
Number    Foo123
Address   10
          First Drive
          London
          AB34 5FG
Region    United Kingdom

The table doesn't have headers and the "Address" row contains blank cells in the first column for second, city, postcode, etc.

I've managed to get the table, just fine.

table = response.xpath('//table[@id="MemberDetails"]/tr/td//text()')

This is the output:

[<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'Name:\xa0'>,
 <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'\r\nFoo\xa0\r\n'>,
 <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'Number:\xa0'>,
 <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'\r\nFoo123\xa0\r\n'>,
 <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'Address:\xa0'>,
 <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'\r\n(10)\xa0\r\n'>,
 <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'\xa0'>,
 <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'\r\nFirst Drive\xa0\r\n'>,
 <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'\xa0'>,
 <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'\r\nLondon\xa0\r\n'>,
 <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'\xa0'>,
 <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'\r\nAB34 5FG\xa0\r\n'>,
 <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'\xa0'>,
 <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'\r\nUnited Kingdom\xa0\r\n'>,
 <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'Region:\xa0'>,
 <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'\r\nUnited Kingdom\xa0\r\n'>]

However, I'm stumped as to how I can parse the table into a proper structure.

1st Question: Not sure how I can deal with the address field. 2nd Question: This is a two column table. When saving this, I'd like to transpose such that, the "Name, Number, Address, Region" are column headings.

There are 1000's of pages like this that contain similar data.

Appreciate if someone can point me in the right direction.

Upvotes: 0

Views: 630

Answers (3)

gangabass
gangabass

Reputation: 10666

You can generate a dictionary for all rows in your table:

def parse(self, response):

    table_data = {}
    current_key = None

    for tr in response.xpath('//table[@id="MemberDetails"]//tr'):

        key = tr.xpath('string(./td[1])').extract_first()
        value = tr.xpath('string(./td[2])').extract_first()

        if key:
            key = key.strip()
            key = key.replace(":", "")
        if value:
            value = value.strip()

        if key:
            current_key = key

        if current_key in table_data:
            table_data[current_key] += '\n' + value
        else:
            table_data[current_key] = value

    print(table_data["Address"])

Upvotes: 1

Chillie
Chillie

Reputation: 1485

Let's work with the sample you have provided us with. (Working with this link )

Let's look at how a table line is formatted.

<tr>
<td bgcolor="#7EADAD">
<strong>Membership Name:&nbsp;</strong>
</td>
<td>
MESSRS R J &amp; L A ACTON&nbsp;
</td>
</tr>

This is great for us, we have different attributes for headers and values: headers are within the strong tags within tds with bgcolor values are directly inside their tds.

Let's check out how the empty lines for the address info look:

<tr>
<td bgcolor="#7EADAD">
<strong>&nbsp;</strong>
</td>
<td>
NORTHUMBERLAND&nbsp;
</td>
</tr>

Great, same structure.

This means we can loop through all trs and grab their data depending on the attributes if we need to. Here's a minimal example that does not deal with links:

for tr in response.xpath('//table[@id="MemberDetails"]/tr'):
    header = tr.xpath('td/strong/text()').extract()[0].strip()
    value = tr.xpath('td')[1].xpath('text()').extract()[0].strip()
    print(u'{} -- {}'.format(header, value))

You next step is to collect the data (make a dictionary? separate lists? write directly to files? It's your choice.).

For dealing with the address: you could make an address variable in the loop for trs that is concatenated with the value if the header is empty (assuming only the address headers are empty). Something like this (again< minimal example):

for tr in response.xpath('//table[@id="MemberDetails"]/tr'):
    address = ''
    header = tr.xpath('td/strong/text()').extract()[0].strip()
    value = tr.xpath('td')[1].xpath('text()').extract()[0].strip()
    if not header: # empty strings evaluate to False
        address += '' + value

You will also have to work out how to deal with headers/values that are stored within a tags.

Upvotes: 1

Sewake
Sewake

Reputation: 947

You can do something like this:

data = {}
rows = response.css('table#MemberDetails tr')
for row in rows:
   label = row.css('td:nth-child(1) strong::text').extract_first().strip()
   value = row.css('td+td::text').extract_first().strip()
   if label:
       label = label.replace(':', '')
       data[label] = value
   else:
       data['Address'] = data['Address'] + ', ' + value
print(data)

It does not work on every situation (for example, in your link Herd Completeness of Performance Rating: label is in a <a> tag and the value is an image), but you have a beginning of solution :)

Other solution:

name = response.css('td:contains("Membership Name:") + td::text').extract_first().strip()
...
address = response.css('td:contains("Address:") + td::text').extract_first().strip()
second = response.css('tr:contains("Address:") + tr td + td::text').extract_first().strip()
city = response.css('tr:contains("Address:") + tr + tr td + td::text').extract_first().strip()

Upvotes: 1

Related Questions