Reputation: 490
I am trying to scrap a webpage to get table values from text data returned from requests response.
</thead>
<tbody class="stats"></tbody>
<tbody class="annotation"></tbody>
</table>
</div>
Actually there is some data present inside tbody
classes but `I am unable to access that data using requests.
Here is my code
server = "http://www.ebi.ac.uk/QuickGO/GProtein"
header = {'User-agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de;
rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'}
payloads = {'ac':'Q9BRY0'}
response = requests.get(server, params=payloads)
print(response.text)
#soup = BeautifulSoup(response.text, 'lxml')
#print(soup)
Upvotes: 2
Views: 1766
Reputation: 21643
Frankly, I'm beginning to lose interest in routine scraping involving products like selenium, and then beyond that I wasn't sure it would work. This approach does.
You would only do this, in this form at least, if you had more than a few files to download.
>>> import bs4
>>> form = '''<form method="POST" action="GAnnotation"><input name="a" value="" type="hidden"><input name="termUse" value="ancestor" type="hidden"><input name="relType" value="IPO=" type="hidden"><input name="customRelType" value="IPOR+-?=" type="hidden"><input name="protein" value="Q9BRY0" type="hidden"><input name="tax" value="" type="hidden"><input name="qualifier" value="" type="hidden"><input name="goid" value="" type="hidden"><input name="ref" value="" type="hidden"><input name="evidence" value="" type="hidden"><input name="with" value="" type="hidden"><input name="source" value="" type="hidden"><input name="q" value="" type="hidden"><input name="col" value="proteinDB,proteinID,proteinSymbol,qualifier,goID,goName,aspect,evidence,ref,with,proteinTaxon,date,from,splice" type="hidden"><input name="select" value="normal" type="hidden"><input name="aspectSorter" value="" type="hidden"><input name="start" value="0" type="hidden"><input name="count" value="25" type="hidden"><input name="format" value="gaf" type="hidden"><input name="gz" value="false" type="hidden"><input name="limit" value="22" type="hidden"></form>'''
>>> soup = bs4.BeautifulSoup(form, 'lxml')
>>> action = soup.find('form').attrs['action']
>>> action
'GAnnotation'
>>> inputs = soup.findAll('input')
>>> params = {}
>>> for input in inputs:
... params[input.attrs['name']] = input.attrs['value']
...
>>> import requests
>>> r = requests.post('http://www.ebi.ac.uk/QuickGO/GAnnotation', data=params)
>>> r
<Response [200]>
>>> open('temp.htm', 'w').write(r.text)
4082
The downloaded file is what you would receive if you simply clicked on the button.
Details for the Chrome browser:
You want the outerHTML
property of this element for the information used in the code above, namely for its action
and name-value pairs. (And the implied information that POST is used.)
Now use the requests module to submit a request to the website.
Here's a list of the items in params
in case you want to make other requests.
>>> for item in params.keys():
... item, params[item]
...
('qualifier', '')
('source', '')
('count', '25')
('protein', 'Q9BRY0')
('format', 'gaf')
('termUse', 'ancestor')
('gz', 'false')
('with', '')
('goid', '')
('start', '0')
('customRelType', 'IPOR+-?=')
('evidence', '')
('aspectSorter', '')
('tax', '')
('relType', 'IPO=')
('limit', '22')
('col', 'proteinDB,proteinID,proteinSymbol,qualifier,goID,goName,aspect,evidence,ref,with,proteinTaxon,date,from,splice')
('q', '')
('ref', '')
('select', 'normal')
('a', '')
Upvotes: 1
Reputation: 2132
I get from your comment above that you're dealing with javascript. in order to scrape & parse javascript you could use selenium, Here is a snippet that could help in your case:
from selenium import webdriver
from bs4 import BeautifulSoup
url =''
browser = webdriver.Chrome()
browser.get(url)
soup = BeautifulSoup(browser.page_source, "lxml")
soup.prettify()
you will have to install ChromeDriver & Chrome Browser tho. if you want you could use headless browser like PhantomJs so you wouldn't have to deal with the whole chrome browser every time you execute the script.
Upvotes: 0