Reputation: 467
I am trying to programmatically send a list of genes to the well-known website DAVID (http://david.abcc.ncifcrf.gov/summary.jsp) for functional annotation. Although there are other two ways - the API service (http://david.abcc.ncifcrf.gov/content.jsp?file=DAVID_API.html) and the web service (http://david.abcc.ncifcrf.gov/content.jsp?file=WS.html), the former has stricter query limitations and the latter doesn't accept my ID type (http://david.abcc.ncifcrf.gov/forum/viewtopic.php?f=14&t=885), so the only choice seems to be a program to post the form, parse the resulting page and extract the download link. Using the firefox plugin 'httpFox' to monitor the transmission, I gave a try with the following script:
import urllib
import urllib2
import requests as rq
import time
_n = 1
url0 = 'http://david.abcc.ncifcrf.gov'
url = 'http://david.abcc.ncifcrf.gov/summary.jsp'
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:30.0) Gecko/20100101 Firefox/30.0'
def get_cookie(session_id): # prepare 'Cookie' in the headers for the post
domain_hash = '260267544' # according to what's been sent by firefox
random_uid = '1113731634' # according to what's been sent by firefox
global _t0
init_time = _t0
global _t
prev_time = _t
_t = int(time.time())
curr_time = _t
global _n
_n += 1
session_count = _n
campaign_count = 1
utma = '.'.join(str(x) for x in (domain_hash, random_uid, init_time, prev_time, curr_time, session_count))
utmz = '.'.join(str(x) for x in (domain_hash, init_time, session_count, campaign_count, 'utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)'))
cookie = '; '.join(str(x) for x in ('__utma=' + utma, '__utmz=' + utmz, 'JSESSIONID=' + session_id))
return(cookie)
# first get the session ID
_t = int(time.time())
_t0 = _t
headers = {'User-Agent' : user_agent}
r = rq.get(url, headers = headers)
session_id = r.cookies['JSESSIONID']
cookie = get_cookie(session_id)
# get the gene list
gene = []
fh = open('list.txt', 'r')
for line in fh:
gene.append(line.rstrip('\n'))
fh.close()
# then post the form
headers = { # all below is according to what's been sent by firefox
'Host' : 'david.abcc.ncifcrf.gov',
'User-Agent' : user_agent,
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' : 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip, deflate',
'Referer' : url,
'Cookie': cookie,
'Connection' : 'keep-alive',
# 'Content-Type' : 'multipart/form-data; boundary=---------------------------17914945481928137296675300642',
# 'Content-Length' : '3581'
}
data = { # all below is according to what's been sent by firefox
'idType' : 'OFFICIAL_GENE_SYMBOL',
'uploadType' : 'list',
'multiList' : 'false',
'Mode' : 'paste',
'useIndex' : 'null',
'usePopIndex' : 'null',
'demoIndex' : 'null',
'ids' : '\n'.join(gene),
'removeIndex' : 'null',
'renameIndex' : 'null',
'renamePopIndex' : 'null',
'newName' : 'null',
'combineIndex' : 'null',
'selectedSpecies' : 'null',
'SESSIONID' : session_id[-12:], # according to the pattern that the last 12 characters of 'JSESSIONID' is sent by firefox
'uploadHTML' : 'null',
'managerHTML' : 'null',
'sublist' : '',
'rowids' : '',
'convertedListName' : 'null',
'convertedPopName' : 'null',
'pasteBox' : '\n'.join(gene),
'fileBrowser' : '',
'Identifier' : 'OFFICIAL_GENE_SYMBOL',
'rbUploadType' : 'list'}
r = rq.post(url = url, data = data, headers = headers)
if r.status_code == 200:
fh = open("python.html", 'w')
fh.write(r.text)
fh.close()
However, the page got by my code is 272KB, definitely different from the content returned by httpFox, which is 428KB. I compared the header and the form sent by my script and by firefox, the difference seems only to be in
Above is the basic situation, and I appreciate if someone can help figure out specifically where the problem is. Besides, I've seen some other advice, e.g. trying the browser emulator 'mechanize'. But I am more curious about the reason, i.e. is it something wrong with my program and if so how to correct it, or are these modules simply not sufficient for the task? Thanks a lot.
My list to post is:
Apba3
Apoa1bp
Dexi
Dhps
Dnpep
Eral1
Gcsh
Git1
Grtp1
Guk1
Ifrd2
Lsm3
Map2k1ip1
Med31
Mettl11a
Mrpl2
mrpl24
Mrpl30
Mrpl46
Ndufaf3
Nr1h2
Obfc2b
Parp3
Pigt
Pop5
Ppt2
Ptpmt1
RGD1304567
RGD1306215
RGD1309708
Rras
My browser post procedure is:
Then the browser returns a new page with a pop-up window prompting users to select the species and background, which is the content tracked by httpFox in this post, also is what I am trying to capture by my script.
Upvotes: 0
Views: 97
Reputation: 2748
Use Selenium:
from selenium import webdriver
from time import sleep
driver = webdriver.Firefox()
driver.get('http://david.abcc.ncifcrf.gov/summary.jsp')
sleep(0.1)
query = """Apba3
Apoa1bp
Dexi
Dhps
Dnpep
Eral1
Gcsh
Git1
Grtp1
Guk1
Ifrd2
Lsm3
Map2k1ip1
Med31
Mettl11a
Mrpl2
mrpl24
Mrpl30
Mrpl46
Ndufaf3
Nr1h2
Obfc2b
Parp3
Pigt
Pop5
Ppt2
Ptpmt1
RGD1304567
RGD1306215
RGD1309708
Rras"""
listBox = driver.find_element_by_id("LISTBox")
listBox.send_keys(query)
IDT = driver.find_element_by_id("IDT")
IDT.send_keys("O")
radioCheck = driver.find_element_by_name("rbUploadType")
radioCheck.click()
submitButton = driver.find_element_by_name("B52")
submitButton.click()
sleep(0.1)
alert = driver.switch_to_alert()
alert.accept()
sleep(0.1)
html = driver.page_source
The variable "html" contains the page source.
Upvotes: 1