Reputation: 335
I have a feeling the information is somewhere on stack overflow, but I can't find it :-/
I'm looking to get the text from this website: https://www.uniprot.org/uniprot/P28653.fasta but my code returns 'None.' All help is super appreciated!
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://www.uniprot.org/uniprot/P28653_PGS1_MOUSE.fasta')
soup = bs(r.content, 'html.parser')
lst = soup.find_all('pre')
print(lst)
returns
[]
Thanks!!
Upvotes: 2
Views: 1187
Reputation: 25241
Think there is a typo or wrong url to perform your approach - Change the url to http://www.uniprot.org/uniprot/P28653_PGS1_MOUSE
and you will get a list with two elements that you can access by loop or directly e.g. lst[1]
to get the sequence.
Code
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://www.uniprot.org/uniprot/P28653_PGS1_MOUSE')
soup = bs(r.content, 'html.parser')
lst = soup.find_all('pre')
print(lst)
Output
[<pre>>sp|P28653|PGS1_MOUSE Biglycan OS=Mus musculus OX=10090 GN=Bgn PE=1 SV=1
MCPLWLLTLLLALSQALPFEQKGFWDFTLDDGLLMMNDEEASGSDTTSGVPDLDSVTPTF
SAMCPFGCHCHLRVVQCSDLGLKTVPKEISPDTTLLDLQNNDISELRKDDFKGLQHLYAL
VLVNNKISKIHEKAFSPLRKLQKLYISKNHLVEIPPNLPSSLVELRIHDNRIRKVPKGVF
SGLRNMNCIEMGGNPLENSGFEPGAFDGLKLNYLRISEAKLTGIPKDLPETLNELHLDHN
KIQAIELEDLLRYSKLYRLGLGHNQIRMIENGSLSFLPTLRELHLDNNKLSRVPAGLPDL
KLLQVVYLHSNNITKVGINDFCPMGFGVKRAYYNGISLFNNPVPYWEVQPATFRCVTDRL
AIQFGNYKK
</pre>, <pre class="sequence"> 10 20 30 40 50<br/>MCPLWLLTLL LALSQALPFE QKGFWDFTLD DGLLMMNDEE ASGSDTTSGV <br/> 60 70 80 90 100<br/>PDLDSVTPTF SAMCPFGCHC HLRVVQCSDL GLKTVPKEIS PDTTLLDLQN <br/> 110 120 130 140 150<br/>NDISELRKDD FKGLQHLYAL VLVNNKISKI HEKAFSPLRK LQKLYISKNH <br/> 160 170 180 190 200<br/>LVEIPPNLPS SLVELRIHDN RIRKVPKGVF SGLRNMNCIE MGGNPLENSG <br/> 210 220 230 240 250<br/>FEPGAFDGLK LNYLRISEAK LTGIPKDLPE TLNELHLDHN KIQAIELEDL <br/> 260 270 280 290 300<br/>LRYSKLYRLG LGHNQIRMIE NGSLSFLPTL RELHLDNNKL SRVPAGLPDL <br/> 310 320 330 340 350<br/>KLLQVVYLHS NNITKVGIND FCPMGFGVKR AYYNGISLFN NPVPYWEVQP <br/> 360 <br/>ATFRCVTDRL AIQFGNYKK <br/></pre>]
Upvotes: 0
Reputation: 2619
BeautifulSoup provides a simple way to find text content .find(text=True)
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://www.uniprot.org/uniprot/P28653_PGS1_MOUSE.fasta')
soup = bs(r.content, 'html.parser')
print(soup.find(text=True))
Upvotes: 1
Reputation: 1442
There is no html in the site. You can just print r.content
directly (however, I prefer r.text
as it is a string
not a bytes
object) , and it will contain the string on the page. Remember, when you use developer tools in chrome (or other browsers), the html you see when you inspect is not necessarily the same result that requests will get. Usually viewing the source code directly in your browser (or printing out the result of requests.get(url).text/.content
) will give a more accurate picture of what html you are dealing with.
Upvotes: 2
Reputation: 956
Like the comment says, the webpage you are looking at is just plain text. You only use BeautifulSoup
when you are dealing with .html
files.
To get your text, you just need to print the content of your request. It looks like this:
data = requests.get("https://www.uniprot.org/uniprot/P28653.fasta").content
print(data)
Upvotes: 1