Md Faisal
Md Faisal

Reputation: 2991

A python script that automatically input some text in a website and get its source code

I am doing biomedical named extraction using Python.

Now I have to cross check the results from inputting the text to http://text0.mib.man.ac.uk/software/geniatagger/ and parse the source code of the HTML text that I get after submitting text into it.

I want that the same thing to be done in my GUI itself i.e. it input from GUI that I have made and submit the text into this website and get the source code so that for cross checking I don't have to visit each time from the browser.

Thanks in advance

Upvotes: 3

Views: 6475

Answers (1)

Jan Vorcak
Jan Vorcak

Reputation: 19989

Actually, this is a great question!

First thing you have to do is to explore a source code of the website a little bit. If you look at the source code of the website you see this block of code

<form method="POST" action="a.cgi">
<p>
Please enter a text that you want to analyze.
</p>
<p>
<textarea name="paragraph" rows="15" cols="80" wrap="soft">
... some text here ...
### This is a sample. Replace this with your own text.

</textarea>
</p>
<p>
<input type="submit" value="Submit Text" />
<input type="reset" />
</p>
</form>

What you see is that request is send to a.cgi address, since we are already on address

http://text0.mib.man.ac.uk/software/geniatagger/

The data we want to send will be send to address concatenated with this one

http://text0.mib.man.ac.uk/software/geniatagger/a.cgi

But what are we going to send there? We need a data, data are send as "paragraph" POST parameter, you see that since form has attribute method with value POST, and name of textarea is "paragraph"

We open this using this python code

import urllib
import urllib2

text =  """
        Further, while specific constitutive binding to the peri-kappa B site is seen in monocytes, stimulation with phorbol esters induces additional, specific binding. Understanding the monocyte-specific function of the peri-kappa B factor may ultimately provide insight into the different role monocytes and T-cells play in HIV pathogenesis. 

### This is a sample. Replace this with your own text.
        """
data = {
        "paragraph" : text 
       }

encoded_data = urllib.urlencode(data)
content = urllib2.urlopen("http://text0.mib.man.ac.uk/software/geniatagger/a.cgi",
        encoded_data)
print content.readlines()

And what do we get so far? We got an "engine" for your GUI program. What you can do is parse this content variable with python's HTMLParser (optional) And you mentioned that you want to display this in GUI? You can do this using GTK or Qt and map this functionality to a single button, you must read a tutorial , it's really easy for this purpose. If you have problems just comment this post and I can extend this answer with GUI

Upvotes: 5

Related Questions