Reputation: 11
I'm trying to scrape Ngo's data like name,mobile no,city etc from https://ngodarpan.gov.in/index.php/search/. It has names of the NGOs in a table format and on clicking on each name gives way to a pop up page. In my code below, I'm extracting the onclick attribute for each NGO.I am making a get followed by a post request to extract the data. I've tried accessing it using selenium but the json data is not coming.
list_of_cells = []
for cell in row.find_all('td'):
text = cell.text.replace(" ", "")
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
writer=csv.writer(f)
writer.writerow(list_of_cells)
By implementing above portion we can get entire details of the table of all the pages .In this website there are 7721 pages.we can simply change number_of_pages var.
But our motive is to find Ngo phone no/email id which is the main purpose we will get after clicking ngo name link.But it is not a href to link rather it a api get req followed by post request to fetch data.find in network section of inspect
driver.get("https://ngodarpan.gov.in/index.php/search/") # load the web page
sleep(2)
....
....
driver.find_element(By.NAME,"commit").submit()
for page in range(number_of_pages - 1):
list_of_rows = []
src = driver.page_source # gets the html source of the page
parser = BeautifulSoup(src,'html.parser')
sleep(1)
table = parser.find("table",{ "class" : "table table-bordered table-striped" })
sleep(1)
for row in table.find_all('tr')[:]:
list_of_cells = []
for cell in row.find_all('td'):
x = requests.get("https://ngodarpan.gov.in/index.php/ajaxcontroller/get_csrf")
dat=x.json()
z=dat["csrf_token"]
print(z) # prints csrf token
r= requests.post("https://ngodarpan.gov.in/index.php/ajaxcontroller/show_ngo_info", data = {'id':'','csrf_test_name':'z'})
json_data=r.text # i guess here is something not working it is printing html text but we need text data of post request like mob,email,and here it will print all the data .
with open('data1.json', 'a') as outfile:
json.dump(json_data, outfile)
driver.find_element_by_xpath("//a[contains(text(),'»')]").click()
There is no such error message the code is running but it is printing html content
<html>
...
...
<body>
<div id="container">
<h1>An Error Was Encountered</h1>
<p>The action you have requested is not allowed.</p> </div>
</body>
</html>
Upvotes: 1
Views: 4170
Reputation: 11
I am tying to iterate over all the pages and extract data in one attempt After extracting data from one page it is not iterating other pages
....
....
['9829059202', '[email protected]', 'CECOEDECON', '206, Jaipur, RAJASTHAN']
['9443382475', '[email protected]', 'ODAM', '43/1995, TIRUCHULI, TAMIL NADU']
['9816510096', '[email protected]', 'OPEN EDUCATIONAL DEVELOPMENT RESEARCH AND WELFARE', '126/2004, SUNDERNAGAR, HIMACHAL PRADESH']
['9425013029', '[email protected]', 'Centre for Advanced Research and Development', '25634, Bhopal, MADHYA PRADESH']
['9204645161', '[email protected]', 'Srijan Mahila Vikas Manch', '833, Chakradharpur, JHARKHAND']
['9419107550', '[email protected]', 'J and K Sai Star Society', '4680-S, Jammu, JAMMU & KASHMIR']
No data returned - retry 2
No data returned - retry 2
No data returned - retry 2
No data returned - retry 2
No data returned - retry 2
...
...
Upvotes: 0
Reputation: 46759
This could be done much faster by avoiding the use of Selenium. Their site appears to continually request a token prior to each request, you might find it is possible to skip this.
The following shows how to get the JSON containing the mobile number and email address:
from bs4 import BeautifulSoup
import requests
import time
def get_token(sess):
req_csrf = sess.get('https://ngodarpan.gov.in/index.php/ajaxcontroller/get_csrf')
return req_csrf.json()['csrf_token']
search_url = "https://ngodarpan.gov.in/index.php/ajaxcontroller/search_index_new/{}"
details_url = "https://ngodarpan.gov.in/index.php/ajaxcontroller/show_ngo_info"
sess = requests.Session()
for page in range(0, 10000, 10): # Advance 10 at a time
print(f"Getting results from {page}")
for retry in range(1, 10):
data = {
'state_search' : 7,
'district_search' : '',
'sector_search' : 'null',
'ngo_type_search' : 'null',
'ngo_name_search' : '',
'unique_id_search' : '',
'view_type' : 'detail_view',
'csrf_test_name' : get_token(sess),
}
req_search = sess.post(search_url.format(page), data=data, headers={'X-Requested-With' : 'XMLHttpRequest'})
soup = BeautifulSoup(req_search.content, "html.parser")
table = soup.find('table', id='example')
if table:
for tr in table.find_all('tr'):
row = [td.text for td in tr.find_all('td')]
link = tr.find('a', onclick=True)
if link:
link_number = link['onclick'].strip("show_ngif(')")
req_details = sess.post(details_url, headers={'X-Requested-With' : 'XMLHttpRequest'}, data={'id' : link_number, 'csrf_test_name' : get_token(sess)})
json = req_details.json()
details = json['infor']['0']
print([details['Mobile'], details['Email'], row[1], row[2]])
break
else:
print(f'No data returned - retry {retry}')
time.sleep(3)
This would give you the following kind of output for the first page:
['9871249262', '[email protected]', 'Pragya Network Educational Society', 'S-52559, Narela, DELHI']
['9810042046', '[email protected]', 'HelpAge India', '9270, New Delhi, DELHI']
['9811897589', '[email protected]', 'All India Parivartan Sewa Samiti', 's-43282, New Delhi, DELHI']
Upvotes: 1
Reputation: 479
Switch to an iframe through Selenium and python
You can use an XPath to locate the :
iframe = driver.find_element_by_xpath("//iframe[@name='Dialogue Window']")
Then switch_to the :
driver.switch_to.frame(iframe)
Here's how to switch back to the default content (out of the ):
driver.switch_to.default_content()
In your instance, I believe the 'Dialogue Window' name would be CalendarControlIFrame
Once you switch to that frame, you will be able to use Beautiful Soup to get the frame's html.
Upvotes: 0