Reputation: 29
I am trying scraping all Tables from this web: "https://www.zoznamspravcov.sk/cake_administrator/publishedAdministrators/view/1".
My lame code so far:
import scrapy
class MindopSpider(scrapy.Spider):
name = 'MD'
start_urls = ['https://www.zoznamspravcov.sk/cake_administrator/publishedAdministrators/view/1/']
def parse(self, response):
panel = response.css('div.administrators.view')
tables = response.css('table')
for table in tables:
head = table.css('h2::text').extract()
trs = table.css('tr')
for tr in trs:
rows = table.css('td::text').extract()
yield {'Head':head,'Rows':rows }
In final I would like tables with their names and data. Could anybody help me?. Many thanks :)
Upvotes: 0
Views: 40
Reputation: 2469
I didn't learn to Scrapy, but I would use another library. How about trying the following solution? You need to install this library first, pip install -U simplified_scrapy
from simplified_scrapy import Spider, SimplifiedDoc, SimplifiedMain
class MindopSpider(Spider):
name = 'MD'
allowed_domains = ['zoznamspravcov.sk/']
start_urls = ['https://www.zoznamspravcov.sk/cake_administrator/publishedAdministrators/view/1']
# refresh_urls = True # For debug. If efresh_urls = True, start_urls will be crawled again.
def extract(self, url, html, models, modelNames):
doc = SimplifiedDoc(html)
doc['html']=doc.replaceReg(doc.html,'</th>\s*<td','</td><td') # Correct HTML tags
blocks = doc.selects('div.administrators view>div|table')
datas = []
for block in blocks:
obj = {'rows':[]}
obj['head']=block.h2.text
rows = block.tbody.trs
for row in rows:
obj['rows'].append([c.text for c in row.tds])
datas.append(obj)
print( datas)
return {"Urls": None, "Data": datas} # Return the data to the framework, and the framework will automatically save it.
SimplifiedMain.startThread(MindopSpider()) # Start
Upvotes: 1