Reputation: 2492
I'm learning scrapy and I've run into a snag attempting to submit a form that is controlled by javascript.
I've tried experimenting with a number of things found here on Stack Overflow including Selenium but having no luck (for a number of reasons).
The page I need to scrape is... http://agmarknet.nic.in/
...and do a commodities search. When I inspect elements it appears to have a form "m", with a filed "cmm" needing a commodity value.
<form name="m" method="post">
(...)
<input type="text" name="cmm" onchange="return validateName(document.m.cmm.value);" size="13">
(...)
<input type="button" value="Go" name="Go3" style="color: #000080; font-size: 8pt; font-family: Arial; font-weight: bold" onclick="search1();"></td>
Any advice gratefully accepted!
UPDATE: I've tried this with selenium, but it doesn't find or populate the field. I also wouldn't mind being able to do this without popping up a firefox window...
CrawlSpider.__init__(self)
self.verificationErrors = []
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://agmarknet.nic.in/")
time.sleep(4)
elem = driver.find_element_by_name("cmm")
elem.send_keys("banana")
time.sleep(5)
elem.send_keys(Keys.RETURN)
driver.close()
UPDATE:
I've also tried various iterations of the following, but with no luck. When I submit the search from the web page, fiddler2 tells me it is post'ing the string "cmm=banana&mkt=&search="...but when I use the code below, fiddler tells me nothing is being posted...
class Agmarknet(Spider):
name = "agmarknet"
start_urls = ["http://agmarknet.nic.in/SearchCmmMkt.asp"]
def parse(self, response):
return [FormRequest.from_response(
response,
#formname = "cmm1",
formdata={
'method':'post',
'cmm': 'banana',
'mkt': '',
'search':''},
callback=self.after_search)]
def after_search(self):
print response.body
OUTPUT FROM ABOVE:
{'download_timeout': 180, 'download_latency': 13.44700002670288, 'proxy': 'http://127.0.0.1:8888', 'download_slot': 'agmarknet.nic.in'}
Spider error processing <GET http://agmarknet.nic.in/SearchCmmMkt.asp>
Traceback (most recent call last):
File "Z:\WinPython-32bit-2.7.6.2\python-2.7.6\lib\site-packages\twisted\internet\base.py", line 1201, in mainLoop
self.runUntilCurrent()
File "Z:\WinPython-32bit-2.7.6.2\python-2.7.6\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "Z:\WinPython-32bit-2.7.6.2\python-2.7.6\lib\site-packages\twisted\internet\defer.py", line 382, in callback
self._startRunCallbacks(result)
File "Z:\WinPython-32bit-2.7.6.2\python-2.7.6\lib\site-packages\twisted\internet\defer.py", line 490, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "Z:\WinPython-32bit-2.7.6.2\python-2.7.6\lib\site-packages\twisted\internet\defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "Z:\WindowsDocuments\eclipseworkspaces\BioCom\manoliagro-agmarknetscraper\src\bin\agmarknetscraper\spiders\agmarknet.py", line 34, in parse
callback=self.after_search)]
File "Z:\WinPython-32bit-2.7.6.2\python-2.7.6\lib\site-packages\scrapy-0.22.0-py2.7.egg\scrapy\http\request\form.py", line 36, in from_response
form = _get_form(response, formname, formnumber, formxpath)
File "Z:\WinPython-32bit-2.7.6.2\python-2.7.6\lib\site-packages\scrapy-0.22.0-py2.7.egg\scrapy\http\request\form.py", line 59, in _get_form
raise ValueError("No <form> element found in %s" % response)
exceptions.ValueError: No <form> element found in <200 http://agmarknet.nic.in/SearchCmmMkt.asp>
SpiderRun done
Upvotes: 1
Views: 2587
Reputation: 11396
with or without javascript, in the end, the call will always be translated to some http call, use firebug to track down that call type (get/post), fields and values it is translated to and add those to your scrapy Request.
Upvotes: 1
Reputation: 1124
Obviously the page consists of two frames, a short glance at the source reveals their names 'contents' and 'main'. So your script above nearly does the job, merely missing a single line pointing to the right frame called 'main' with driver.switch_to_frame('main'). Also the form does not react to ENTER-key, we indeed have to select the button and press it :-).
This code is working:
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://agmarknet.nic.in/")
time.sleep(4)
driver.switch_to_frame('main')
textinput = driver.find_element_by_name('cmm')
textinput.send_keys("banana")
time.sleep(1)
button = driver.find_element_by_name("Go3")
button.click()
driver.close()
Upvotes: 4
Reputation: 1124
What will make your life easier especially with javascript is Selenium IDE. It is a firefox-plugin able to record what you click & type in firefox and afterwards showing you the code for certain elements you need to put in your python-script. Was very useful for me, not only with forms. :-)
Try it out!
Upvotes: 0