Oliver
Oliver

Reputation: 335

Python Beautiful Soup html.parser returns none

I have a feeling the information is somewhere on stack overflow, but I can't find it :-/

I'm looking to get the text from this website: https://www.uniprot.org/uniprot/P28653.fasta but my code returns 'None.' All help is super appreciated!

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('http://www.uniprot.org/uniprot/P28653_PGS1_MOUSE.fasta')
soup = bs(r.content, 'html.parser')
lst = soup.find_all('pre')
print(lst)

returns

[]

Thanks!!

Upvotes: 2

Views: 1187

Answers (4)

HedgeHog
HedgeHog

Reputation: 25241

Think there is a typo or wrong url to perform your approach - Change the url to http://www.uniprot.org/uniprot/P28653_PGS1_MOUSE and you will get a list with two elements that you can access by loop or directly e.g. lst[1] to get the sequence.

Code

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('http://www.uniprot.org/uniprot/P28653_PGS1_MOUSE')
soup = bs(r.content, 'html.parser')
lst = soup.find_all('pre')

print(lst)

Output

[<pre>&gt;sp|P28653|PGS1_MOUSE Biglycan OS=Mus musculus OX=10090 GN=Bgn PE=1 SV=1
MCPLWLLTLLLALSQALPFEQKGFWDFTLDDGLLMMNDEEASGSDTTSGVPDLDSVTPTF
SAMCPFGCHCHLRVVQCSDLGLKTVPKEISPDTTLLDLQNNDISELRKDDFKGLQHLYAL
VLVNNKISKIHEKAFSPLRKLQKLYISKNHLVEIPPNLPSSLVELRIHDNRIRKVPKGVF
SGLRNMNCIEMGGNPLENSGFEPGAFDGLKLNYLRISEAKLTGIPKDLPETLNELHLDHN
KIQAIELEDLLRYSKLYRLGLGHNQIRMIENGSLSFLPTLRELHLDNNKLSRVPAGLPDL
KLLQVVYLHSNNITKVGINDFCPMGFGVKRAYYNGISLFNNPVPYWEVQPATFRCVTDRL
AIQFGNYKK
</pre>, <pre class="sequence">        10         20         30         40         50<br/>MCPLWLLTLL LALSQALPFE QKGFWDFTLD DGLLMMNDEE ASGSDTTSGV <br/>        60         70         80         90        100<br/>PDLDSVTPTF SAMCPFGCHC HLRVVQCSDL GLKTVPKEIS PDTTLLDLQN <br/>       110        120        130        140        150<br/>NDISELRKDD FKGLQHLYAL VLVNNKISKI HEKAFSPLRK LQKLYISKNH <br/>       160        170        180        190        200<br/>LVEIPPNLPS SLVELRIHDN RIRKVPKGVF SGLRNMNCIE MGGNPLENSG <br/>       210        220        230        240        250<br/>FEPGAFDGLK LNYLRISEAK LTGIPKDLPE TLNELHLDHN KIQAIELEDL <br/>       260        270        280        290        300<br/>LRYSKLYRLG LGHNQIRMIE NGSLSFLPTL RELHLDNNKL SRVPAGLPDL <br/>       310        320        330        340        350<br/>KLLQVVYLHS NNITKVGIND FCPMGFGVKR AYYNGISLFN NPVPYWEVQP <br/>       360 <br/>ATFRCVTDRL AIQFGNYKK                                   <br/></pre>]

Upvotes: 0

Samsul Islam
Samsul Islam

Reputation: 2619

BeautifulSoup provides a simple way to find text content .find(text=True)

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('http://www.uniprot.org/uniprot/P28653_PGS1_MOUSE.fasta')
soup = bs(r.content, 'html.parser')

print(soup.find(text=True))

Upvotes: 1

goalie1998
goalie1998

Reputation: 1442

There is no html in the site. You can just print r.content directly (however, I prefer r.text as it is a string not a bytes object) , and it will contain the string on the page. Remember, when you use developer tools in chrome (or other browsers), the html you see when you inspect is not necessarily the same result that requests will get. Usually viewing the source code directly in your browser (or printing out the result of requests.get(url).text/.content) will give a more accurate picture of what html you are dealing with.

Upvotes: 2

Akilan Manivannan
Akilan Manivannan

Reputation: 956

Like the comment says, the webpage you are looking at is just plain text. You only use BeautifulSoup when you are dealing with .html files.

To get your text, you just need to print the content of your request. It looks like this:

data = requests.get("https://www.uniprot.org/uniprot/P28653.fasta").content
print(data)

Upvotes: 1

Related Questions