Reputation: 679
I want to scrape data from a website which has TextFields, Buttons etc.. and my requirement is to fill the text fields and submit the form to get the results and then scrape the data points from results page.
I want to know that does Scrapy has this feature or If anyone can recommend a library in Python to accomplish this task?
(edited)
I want to scrape the data from the following website:
http://a836-acris.nyc.gov/DS/DocumentSearch/DocumentType
My requirement is to select the values from ComboBoxes and hit the search button and scrape the data points from the result page.
P.S. I'm using selenium Firefox driver to scrape data from some other website but that solution is not good because selenium Firefox driver is dependent on FireFox's EXE i.e Firefox must be installed before running the scraper.
Selenium Firefox driver is consuming around 100MB memory for one instance and my requirement is to run a lot of instances at a time to make the scraping process quick so there is memory limitation as well.
Firefox crashes sometimes during the execution of scraper, don't know why. Also I need window less scraping which is not possible in case of Selenium Firefox driver.
My ultimate goal is to run the scrapers on Heroku and I have Linux environment over there so selenium Firefox driver won't work on Heroku.
Thanks
Upvotes: 8
Views: 11259
Reputation: 1
I have used Scrapy, Selenium and BeautifulSoup. Scrapy will help in accomplishing your work. For me, BeautifulSoup was more useful as I could use functions like prettify(), find_all() which helped me significantly in understanding the html content.
I would not recommend Selenium as it slows your process. It first loads the browser, it's content and then proceeds with the scraping which results in it taking much longer than the other packages.
Upvotes: 0
Reputation: 474241
Basically, you have plenty of tools to choose from:
These tools have different purposes but they can be mixed together depending on the task.
Scrapy is a powerful and very smart tool for crawling web-sites, extracting data. But, when it comes to manipulating the page: clicking buttons, filling forms - it becomes more complicated:
If you make your question more specific, it'll help to understand what kind of tools you should use or choose from.
Take a look at an example of interesting scrapy&selenium mix. Here, selenium task is to click the button and provide data for scrapy items:
import time
from scrapy.item import Item, Field
from selenium import webdriver
from scrapy.spider import BaseSpider
class ElyseAvenueItem(Item):
name = Field()
class ElyseAvenueSpider(BaseSpider):
name = "elyse"
allowed_domains = ["ehealthinsurance.com"]
start_urls = [
'http://www.ehealthinsurance.com/individual-family-health-insurance?action=changeCensus&census.zipCode=48341&census.primary.gender=MALE&census.requestEffectiveDate=06/01/2013&census.primary.month=12&census.primary.day=01&census.primary.year=1971']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
el = self.driver.find_element_by_xpath("//input[contains(@class,'btn go-btn')]")
if el:
el.click()
time.sleep(10)
plans = self.driver.find_elements_by_class_name("plan-info")
for plan in plans:
item = ElyseAvenueItem()
item['name'] = plan.find_element_by_class_name('primary').text
yield item
self.driver.close()
UPDATE:
Here's an example on how to use scrapy in your case:
from scrapy.http import FormRequest
from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
class AcrisItem(Item):
borough = Field()
block = Field()
doc_type_name = Field()
class AcrisSpider(BaseSpider):
name = "acris"
allowed_domains = ["a836-acris.nyc.gov"]
start_urls = ['http://a836-acris.nyc.gov/DS/DocumentSearch/DocumentType']
def parse(self, response):
hxs = HtmlXPathSelector(response)
document_classes = hxs.select('//select[@name="combox_doc_doctype"]/option')
form_token = hxs.select('//input[@name="__RequestVerificationToken"]/@value').extract()[0]
for document_class in document_classes:
if document_class:
doc_type = document_class.select('.//@value').extract()[0]
doc_type_name = document_class.select('.//text()').extract()[0]
formdata = {'__RequestVerificationToken': form_token,
'hid_selectdate': '7',
'hid_doctype': doc_type,
'hid_doctype_name': doc_type_name,
'hid_max_rows': '10',
'hid_ISIntranet': 'N',
'hid_SearchType': 'DOCTYPE',
'hid_page': '1',
'hid_borough': '0',
'hid_borough_name': 'ALL BOROUGHS',
'hid_ReqID': '',
'hid_sort': '',
'hid_datefromm': '',
'hid_datefromd': '',
'hid_datefromy': '',
'hid_datetom': '',
'hid_datetod': '',
'hid_datetoy': '', }
yield FormRequest(url="http://a836-acris.nyc.gov/DS/DocumentSearch/DocumentTypeResult",
method="POST",
formdata=formdata,
callback=self.parse_page,
meta={'doc_type_name': doc_type_name})
def parse_page(self, response):
hxs = HtmlXPathSelector(response)
rows = hxs.select('//form[@name="DATA"]/table/tbody/tr[2]/td/table/tr')
for row in rows:
item = AcrisItem()
borough = row.select('.//td[2]/div/font/text()').extract()
block = row.select('.//td[3]/div/font/text()').extract()
if borough and block:
item['borough'] = borough[0]
item['block'] = block[0]
item['doc_type_name'] = response.meta['doc_type_name']
yield item
Save it in spider.py
and run via scrapy runspider spider.py -o output.json
and in output.json
you will see:
{"doc_type_name": "CONDEMNATION PROCEEDINGS ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CERTIFICATE OF REDUCTION ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "COLLATERAL MORTGAGE ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CERTIFIED COPY OF WILL ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CONFIRMATORY DEED ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CERT NONATTCHMENT FED TAX LIEN ", "borough": "Borough", "block": "Block"}
...
Hope that helps.
Upvotes: 18
Reputation: 76835
If you simply want to submit the form and extract data from the resulting page, I'd go for:
Scrapy added value really holds in its ability to follow links and crawl a website, I don't think it is the right tool for the job if you know precisely what you are searching for.
Upvotes: 3
Reputation: 6891
I would personally use mechanize as I do not have any experience with scrapy. However a library named scrapy purpose built for screen scraping should be up for the task. I would just have a go with both of them and see which does the job best/easiest.
Upvotes: 2