page returned by python post different from that by browser

Question

I am trying to programmatically send a list of genes to the well-known website DAVID (http://david.abcc.ncifcrf.gov/summary.jsp) for functional annotation. Although there are other two ways - the API service (http://david.abcc.ncifcrf.gov/content.jsp?file=DAVID_API.html) and the web service (http://david.abcc.ncifcrf.gov/content.jsp?file=WS.html), the former has stricter query limitations and the latter doesn't accept my ID type (http://david.abcc.ncifcrf.gov/forum/viewtopic.php?f=14&t=885), so the only choice seems to be a program to post the form, parse the resulting page and extract the download link. Using the firefox plugin 'httpFox' to monitor the transmission, I gave a try with the following script:

import urllib
import urllib2
import requests as rq
import time

_n = 1
url0 = 'http://david.abcc.ncifcrf.gov'
url = 'http://david.abcc.ncifcrf.gov/summary.jsp'
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:30.0) Gecko/20100101 Firefox/30.0'

def get_cookie(session_id): # prepare 'Cookie' in the headers for the post
    domain_hash = '260267544' # according to what's been sent by firefox 
    random_uid = '1113731634' # according to what's been sent by firefox
    global _t0
    init_time = _t0
    global _t 
    prev_time = _t
    _t = int(time.time())
    curr_time = _t
    global _n
    _n += 1
    session_count = _n
    campaign_count = 1
    utma = '.'.join(str(x) for x in (domain_hash, random_uid, init_time, prev_time, curr_time, session_count))
    utmz = '.'.join(str(x) for x in (domain_hash, init_time, session_count, campaign_count, 'utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)'))
    cookie = '; '.join(str(x) for x in ('__utma=' + utma, '__utmz=' + utmz, 'JSESSIONID=' + session_id)) 
    return(cookie)

# first get the session ID
_t = int(time.time())
_t0 = _t
headers = {'User-Agent' : user_agent}
r = rq.get(url, headers = headers) 
session_id = r.cookies['JSESSIONID']
cookie = get_cookie(session_id)

# get the gene list
gene = []
fh = open('list.txt', 'r')
for line in fh:
    gene.append(line.rstrip('
'))

fh.close()

# then post the form
headers = {  # all below is according to what's been sent by firefox
           'Host' : 'david.abcc.ncifcrf.gov',
           'User-Agent' : user_agent, 
           'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 
           'Accept-Language' : 'en-US,en;q=0.5', 
           'Accept-Encoding' : 'gzip, deflate',
           'Referer' : url,
           'Cookie': cookie, 
           'Connection' : 'keep-alive', 
#           'Content-Type' : 'multipart/form-data; boundary=---------------------------17914945481928137296675300642',
#           'Content-Length' : '3581'
           }

data = {  # all below is according to what's been sent by firefox
        'idType' : 'OFFICIAL_GENE_SYMBOL',
        'uploadType' : 'list', 
        'multiList' : 'false', 
        'Mode' : 'paste', 
        'useIndex' : 'null',
        'usePopIndex' : 'null', 
        'demoIndex' : 'null', 
        'ids' : '
'.join(gene), 
        'removeIndex' : 'null', 
        'renameIndex' : 'null', 
        'renamePopIndex' : 'null', 
        'newName' : 'null', 
        'combineIndex' : 'null', 
        'selectedSpecies' : 'null', 
        'SESSIONID' : session_id[-12:], # according to the pattern that the last 12 characters of 'JSESSIONID' is sent by firefox
        'uploadHTML' : 'null', 
        'managerHTML' : 'null', 
        'sublist' : '',
        'rowids' : '',
        'convertedListName' : 'null', 
        'convertedPopName' : 'null', 
        'pasteBox' : '
'.join(gene), 
        'fileBrowser' : '', 
        'Identifier' : 'OFFICIAL_GENE_SYMBOL', 
        'rbUploadType' : 'list'}

r = rq.post(url = url, data = data, headers = headers)
if r.status_code == 200:
    fh = open("python.html", 'w')
    fh.write(r.text)
    fh.close()

However, the page got by my code is 272KB, definitely different from the content returned by httpFox, which is 428KB. I compared the header and the form sent by my script and by firefox, the difference seems only to be in

the cookie fields __utma and __utmz, but they are related to google analytics, it sounds they shouldn't really matter, and
the fields 'Content-Type' and 'Content-Length' in the 2nd header where I commented. Due to the suggestion in Is Python requests doing something wrong here, or is my POST request lacking something?, it appears unnecessary to specify them manually. However even after I commented them, it doesn't work.

Above is the basic situation, and I appreciate if someone can help figure out specifically where the problem is. Besides, I've seen some other advice, e.g. trying the browser emulator 'mechanize'. But I am more curious about the reason, i.e. is it something wrong with my program and if so how to correct it, or are these modules simply not sufficient for the task? Thanks a lot.

My list to post is:

Apba3
Apoa1bp
Dexi
Dhps
Dnpep
Eral1
Gcsh
Git1
Grtp1
Guk1
Ifrd2
Lsm3
Map2k1ip1
Med31
Mettl11a
Mrpl2
mrpl24
Mrpl30
Mrpl46
Ndufaf3
Nr1h2
Obfc2b
Parp3
Pigt
Pop5
Ppt2
Ptpmt1
RGD1304567
RGD1306215
RGD1309708
Rras

My browser post procedure is:

in firefox open http://david.abcc.ncifcrf.gov/summary.jsp
in the left panel in default, input the above gene list in the box "Step 1: Enter Gene List A: Paste a list"
click the drop-down button and select "OFFICIAL_GENE_SYMBOL" in "Step 2: Select Identifier"
check the radio button "Gene List" in "Step 3: List Type"
click "Submit List" in "Step 4: Submit List"

Then the browser returns a new page with a pop-up window prompting users to select the species and background, which is the content tracked by httpFox in this post, also is what I am trying to capture by my script.

caiohamamura · Accepted Answer

Use Selenium:

from selenium import webdriver
from time import sleep

driver = webdriver.Firefox()
driver.get('http://david.abcc.ncifcrf.gov/summary.jsp')
sleep(0.1)
query = """Apba3
Apoa1bp
Dexi
Dhps
Dnpep
Eral1
Gcsh
Git1
Grtp1
Guk1
Ifrd2
Lsm3
Map2k1ip1
Med31
Mettl11a
Mrpl2
mrpl24
Mrpl30
Mrpl46
Ndufaf3
Nr1h2
Obfc2b
Parp3
Pigt
Pop5
Ppt2
Ptpmt1
RGD1304567
RGD1306215
RGD1309708
Rras"""
listBox = driver.find_element_by_id("LISTBox")
listBox.send_keys(query)

IDT = driver.find_element_by_id("IDT")
IDT.send_keys("O")

radioCheck = driver.find_element_by_name("rbUploadType")
radioCheck.click()


submitButton = driver.find_element_by_name("B52")

submitButton.click()
sleep(0.1)
alert = driver.switch_to_alert()
alert.accept()
sleep(0.1)
html = driver.page_source

The variable "html" contains the page source.

page returned by python post different from that by browser

Answers (1)

Related Questions