Reputation: 352
I am trying to get the uniprot ID from this webpage: ENSEMBL . But I am having trouble using xpath. Right now I am getting an empty list and I do not understand why.
My idea is to write a small function that takes the ENSEMBL IDs and returns the uniprot ID.
import requests
from lxml import html
ens_code = 'ENST00000378404'
webpage = 'http://www.ensembl.org/id/'+ens_code
response = requests.get(webpage)
tree = html.fromstring(response.content)
path = '//*[@id="ensembl_panel_1"]/div[2]/div[3]/div[3]/div[2]/p/a'
uniprot_id = tree.xpath(path)
print uniprot_id
Any help would be appreciated :)
It is only printing the existing lists but is still returning the Nonetype list.
def getUniprot(ensembl_code):
ensembl_code = ensembl_code[:-1]
webpage = 'http://www.ensembl.org/id/'+ensembl_code
response = requests.get(webpage)
tree = html.fromstring(response.content)
path = '//div[@class="lhs" and text()="Uniprot"]/following-sibling::div/p/a/text()'
uniprot_id = tree.xpath(path)
if uniprot_id:
print uniprot_id
return uniprot_id
Upvotes: 2
Views: 773
Reputation: 180481
Why you getting an empty list is because it looks like you used the xpath that chrome supplied when you right clicked and chose copy xpath, the reason your xpath returns nothing is because the tag is not in the source, it is dynamically generated so what requests returns does not contain the element.
In [6]: response = requests.get(webpage)
In [7]: "ensembl_panel_1" in response.content
Out[7]: False
You should always check the page source to see what you are actually getting back, what you see in the developer console is not necessarily what you get when you download the source.
You can also use a specific xpath in case there were other http://www.uniprot.org/uniprot/
on the page, searching the divs for a class with "lhs"
and the text Uniprot
then get the text from the first following anchor tag:
path = '//div[@class="lhs" and text()="Uniprot"]/following::a[1]/text()'
Which would give you:
['Q8TDY3']
You can also select the following sibling div where the anchor is inside it's child p tag:
path = '//div[@class="lhs" and text()="Uniprot"]/following-sibling::div/p/a/text()'
Upvotes: 3