Slowat_Kela
Slowat_Kela

Reputation: 1511

Urllib2 in python: why is it not returning the webpage formatting and not the actual data

can someone tell me why, when I run this code:

import urllib2
for i in range(1,2):
        id_name ='AP' + str("{:05d}".format(i))
        web_page = "http://aps.unmc.edu/AP/database/query_output.php?ID=" + id_name
        page = urllib2.urlopen(web_page)
        html = page.read()
        print html

It returns:

<html>
<head>
<title>detailed information</title>
<style type="text/css">
H1 {font-family:"Time New Roman", Times; font-style:bold; font-size:18pt; color:blue}
H1{text-align:center}
P{font-family:"Time New Roman", Times; font-style:bold; font-size:14pt; line-height:20pt}
P{text-align:justify;margin-left:0px; margin-right:0px;color:blue}
/body{background-image:url('sky.gif')}
/
A:link{color:blue}
A:visited{color:#996666}
</style>
</head>
<H1>Antimicrobial Peptide APAP00001</H1>
<html>
<p style="margin-left: 400px; margin-top: 4; margin-bottom: 0; line-height:100%">
<b>
<a href = "#" onclick = "window.close(self)"><font size="3" color=blue>Close this window
</font> </a>
</b>
</p>
</p>
</body>
</html>

And not the actual data on the page (http://aps.unmc.edu/AP/database/query_output.php?ID=00001) (e.g. net charge, length)?

If I edit this code slightly somehow, is it possible to return all of the information on the page (e.g. the information about net charge, length etc), and not just information about how the page is formatted?

Thanks

Edit 1: Due to Gahan's comment below, I tried this: import requests from bs4 import BeautifulSoup

for i in range(8,9):
        webpage = "https://dbaasp.org/peptide-card?type=39&id=" + str(i)
        response = requests.get(webpage)
        soup = BeautifulSoup(response.content, 'html.parser')
        print soup

However, I still seem the same answer (for example, if I run the edit 1 code and direct output to a file, and then grep the peptide sequence in the output file, it is not there).

Upvotes: 0

Views: 42

Answers (2)

Gahan
Gahan

Reputation: 4213

use requests library:

import requests
from bs4 import BeautifulSoup
data_require = ["Net charge", ]
for i in range(1,2):
    id_value ="{:05d}".format(i)
    url = "http://aps.unmc.edu/AP/database/query_output.php"
    payload = {"ID": id_value}
    response = requests.get(url, params=payload)
    soup = BeautifulSoup(response.content, 'html.parser')
    table_structure = soup.find('table')
    all_p_tag = table_structure.find_all('p')
    data = {}
    for i in range(0, len(all_p_tag), 2):
        data[all_p_tag[i].text] = all_p_tag[i+1].text.encode('utf-8').strip()
        print("{} {}".format(all_p_tag[i].text, all_p_tag[i+1].text.encode('utf-8').strip()))
    print(data)

Note: you don't need to convert "{:05d}".format(i) to string as it will only return string when you use format() because it's string formatting.

also I have updated code to get tag details too. you don't need to use grep for it because BeautifulSoup is already providing such facility.

Upvotes: -1

bruno desthuilliers
bruno desthuilliers

Reputation: 77892

In your original snippet, you use "AP00001" as query param:

id_name ='AP' + str("{:05d}".format(i))

so your url is: "http://aps.unmc.edu/AP/database/query_output.php?ID=AP00001", instead of "http://aps.unmc.edu/AP/database/query_output.php?ID=00001"

A fixed version of your first snippet using requests:

url = "http://aps.unmc.edu/AP/database/query_output.php"
for i in range(1,2):
    id_name = "{:05d}".format(i)
    response = requests.get(url, params={"ID":id_name})
    print response.content

Upvotes: 2

Related Questions