Reputation: 77
I have a page that has a table (table id= "ctl00_ContentPlaceHolder_ctl00_ctl00_GV" class="GridListings" )i need to scrape. I usually use BeautifulSoup & urllib for it,but in this case the problem is that the table takes some time to load ,so it isnt captured when i try to fetch it using BS. I cannot use PyQt4,drysracpe or windmill because of some installation issues,so the only possible way is to use Selenium/PhantomJS I tried the following,still no success:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS()
driver.get(url)
wait = WebDriverWait(driver, 10)
table = wait.until(EC.presence_of_element_located(By.CSS_SELECTOR, 'table#ctl00_ContentPlaceHolder_ctl00_ctl00_GV'))
The above code doesnt give me the desired contents of the table. How do i go about achieveing this???
Upvotes: 2
Views: 757
Reputation: 6736
If you want to scrap something, it will be nice first to install a web debugger ( Firebug for Mozilla Firefox for example) to watch how the website you want to scrap is working.
Next, you need to copy the process of how the website is connecting to backoffice
As you said, the content that you want to scrap is being loaded asynchronously (only when the document is ready)
Assuming the debugger is running and also you have refreshed the page, you will see on the network tab the following request:
POST https://seahawks.strmarketplace.com/Charter-Seat-Licenses/Charter-Seat-Licenses.aspx
The final process flow to reach your goal will be:
Please see bellow a working code:
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
from bs4 import BeautifulSoup
import requests
base_url="https://seahawks.strmarketplace.com/Charter-Seat-Licenses/Charter-Seat-Licenses.aspx"
#create requests session
s = requests.session()
#get index page
r=s.get(base_url)
#soup page
bs=BeautifulSoup(r.text)
#extract FORM html
form_soup= bs.find('form',{'name':'aspnetForm'})
#extracting all inputs
input_div = form_soup.findAll("input")
#build the data parameters for POST request
#we add some required <fixed> data parameters for post
data={
'__EVENTARGUMENT':'LISTINGS;0',
'__EVENTTARGET':'ctl00$ContentPlaceHolder$ctl00$ctl00$RadAjaxPanel_GV',
'__EVENTVALIDATION':'/wEWGwKis6fzCQLDnJnSDwLq4+CbDwK9jryHBQLrmcucCgL56enHAwLRrPHhCgKDk6P+CwL1/aWtDQLm0q+gCALRvI2QDAKch7HjBAKWqJHWBAKil5XsDQK58IbPAwLO3dKwCwL6uJOtBgLYnd3qBgKyp7zmBAKQyTBQK9qYAXAoieq54JAuG/rDkC1djKyQMC1qnUtgoC0OjaygUCv4b7sAhfkEODRvsa3noPfz2kMsxhAwlX3Q=='
}
#we add some <dynamic> data parameters
for input_d in input_div:
try:
data[ input_d['name'] ] =input_d['value']
except:
pass #skip unused input field
#post request
r2=s.post(base_url,data=data)
#write the result
with open("post_result.html","w") as f:
f.write(r2.text.encode('utf8'))
Now, please get a look at "post_result.html" content and you will find the data !
Regards
Upvotes: 0
Reputation: 180540
You can get the data using requests and bs4,, with almost if not all asp sites there are a few post params that always need to be provided like __EVENTTARGET, __EVENTVALIDATION etc.. :
from bs4 import BeautifulSoup
import requests
data = {"__EVENTTARGET": "ctl00$ContentPlaceHolder$ctl00$ctl00$RadAjaxPanel_GV",
"__EVENTARGUMENT": "LISTINGS;0",
"ctl00$ContentPlaceHolder$ctl00$ctl00$ctl00$hdnProductID": "139",
"ctl00$ContentPlaceHolder$ctl00$ctl00$hdnProductID": "139",
"ctl00$ContentPlaceHolder$ctl00$ctl00$drpSortField": "Listing Number",
"ctl00$ContentPlaceHolder$ctl00$ctl00$drpSortDirection": "A-Z, Low-High",
"__ASYNCPOST": "true"}
And for the actual post, we need to add a few more values to out post data:
post = "https://seahawks.strmarketplace.com/Charter-Seat-Licenses/Charter-Seat-Licenses.aspx"
with requests.Session() as s:
s.headers.update({"User-Agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"})
soup = BeautifulSoup(s.get(post).content)
data["__VIEWSTATEGENERATOR"] = soup.select_one("#__VIEWSTATEGENERATOR")["value"]
data["__EVENTVALIDATION"] = soup.select_one("#__EVENTVALIDATION")["value"]
data["__VIEWSTATE"] = soup.select_one("#__VIEWSTATE")["value"]
r = s.post(post, data=data)
soup2 = BeautifulSoup(r.content)
table = soup2.select_one("div.GridListings")
print(table)
You will see the table printed when you run the code.
Upvotes: 2