Reputation: 1511
I wanted to scrape the table on this page.
I wrote this code:
import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
import sys
import requests
import pandas as pd
webpage = 'https://web.iitm.ac.in/bioinfo2/cpad2/peptides/?page=1'
page = urllib.request.urlopen(webpage)
soup = BeautifulSoup(page,'html.parser')
soup_text = soup.get_text()
print(soup)
The output is an error:
Traceback (most recent call last):
File "scrape_cpad.py", line 9, in <module>
page = urllib.request.urlopen(webpage)
File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 570, in error
return self._call_chain(*args)
File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 500: Internal Server Error
I've tried on two different computers and networks. But also, I can see the server is running, because I can visit the page via HTML and also view the source code of the page.
I also tried changing the URL from https to http or www.
Could someone show me what is the working code to be able to connect to this page to pull down the table?
p.s. I've seen that there are similar questions e.g. here and here, but not one that answers my question.
Upvotes: 1
Views: 682
Reputation: 195438
Use requests
module to grab the page.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://web.iitm.ac.in/bioinfo2/cpad2/peptides/?page=1'
soup = BeautifulSoup(requests.get(url).content ,'html.parser')
for tr in soup.select('tr[data-toggle="modal"]'):
print(tr.get_text(strip=True, separator=' '))
print('-' * 120)
Prints:
P-0001 GYE 3 Amyloid Amyloid-beta precursor protein (APP) P05067 No Org Lett. 2008 Jul 3;10(13):2625-8. 18529009 CPAD
------------------------------------------------------------------------------------------------------------------------
P-0002 KFFE 4 Amyloid J Biol Chem. 2002 Nov 8;277(45):43243-6. 12215440 CPAD
------------------------------------------------------------------------------------------------------------------------
P-0003 KVVE 4 Amyloid J Biol Chem. 2002 Nov 8;277(45):43243-6. 12215440 CPAD
------------------------------------------------------------------------------------------------------------------------
P-0004 NNQQ 4 Amyloid Eukaryotic peptide chain release factor GTP-binding subunit (ERF-3) P05453 Nature. 2007 May 24;447(7143):453-7. 17468747 CPAD
------------------------------------------------------------------------------------------------------------------------
P-0005 VKSE 4 Non-amyloid Microtubule-associated protein tau (PHF-tau) P10636 Proc Natl Acad Sci U S A. 2000 May 9;97(10):5129-34. 10805776 AmyLoad
------------------------------------------------------------------------------------------------------------------------
P-0006 AILSS 5 Amyloid Islet amyloid polypeptide (Amylin) P10997 No Proc Natl Acad Sci U S A. 1990 Jul;87(13):5036-40. 2195544 CPAD
------------------------------------------------------------------------------------------------------------------------
...and so on.
Upvotes: 1
Reputation: 21275
Seems like the server rejects requests that come without a proper User-Agent
header.
I tried setting the User-Agent to my browsers, and I managed to make it respond with a HTML page:
webpage = 'https://web.iitm.ac.in/bioinfo2/cpad2/peptides/?page=1'
req = urllib.request.Request(webpage)
# spoof the UA header
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0)
Gecko/20100101 Firefox/77.0')
page = urllib.request.urlopen(req)
Upvotes: 1