RomanM
RomanM

Reputation: 29

Scrapy.org - Tables

I am trying scraping all Tables from this web: "https://www.zoznamspravcov.sk/cake_administrator/publishedAdministrators/view/1".

My lame code so far:

import scrapy

class MindopSpider(scrapy.Spider):
    name = 'MD'
    start_urls = ['https://www.zoznamspravcov.sk/cake_administrator/publishedAdministrators/view/1/']

def parse(self, response):      
    panel = response.css('div.administrators.view')
    tables = response.css('table')

    for table in tables:
        head = table.css('h2::text').extract()      
        trs = table.css('tr')       
        for tr in trs:
            rows = table.css('td::text').extract()
        yield {'Head':head,'Rows':rows }

In final I would like tables with their names and data. Could anybody help me?. Many thanks :)

Upvotes: 0

Views: 40

Answers (1)

dabingsou
dabingsou

Reputation: 2469

I didn't learn to Scrapy, but I would use another library. How about trying the following solution? You need to install this library first, pip install -U simplified_scrapy

from simplified_scrapy import Spider, SimplifiedDoc, SimplifiedMain
class MindopSpider(Spider):
  name = 'MD'
  allowed_domains = ['zoznamspravcov.sk/']
  start_urls = ['https://www.zoznamspravcov.sk/cake_administrator/publishedAdministrators/view/1']
  # refresh_urls = True # For debug. If efresh_urls = True, start_urls will be crawled again.

  def extract(self, url, html, models, modelNames):
    doc = SimplifiedDoc(html)
    doc['html']=doc.replaceReg(doc.html,'</th>\s*<td','</td><td') # Correct HTML tags
    blocks = doc.selects('div.administrators view>div|table')
    datas = []
    for block in blocks:
        obj = {'rows':[]}
        obj['head']=block.h2.text
        rows = block.tbody.trs
        for row in rows:
            obj['rows'].append([c.text for c in row.tds])
        datas.append(obj)
    print( datas)
    return {"Urls": None, "Data": datas} # Return the data to the framework, and the framework will automatically save it.

SimplifiedMain.startThread(MindopSpider()) # Start

Upvotes: 1

Related Questions