Reputation: 17
I'm currently working on extracting specific data from a table in this website: http://bioinfo.life.hust.edu.cn/lncRNASNP/#!/lnc_disease . For this, I'm using Python and selenium.
My problem is that when I try to search for the table using read_html() from pandas, only the headers of the table are found, this is the output I get:
lncRNA ID Chromosome Disease Pubmed P-value Bonferroni Variant miRNA Gain Loss
0 No items No items No items No items No items No items No items No items No items No items
And this is the code I used:
driver.get(link)
df = pd.read_html(driver.page_source)[0]
print(df.head())
driver.close()
Also, if I try to access the source code for the website so I can extract the table directly from it (using BeautifulSoup and request libraries), it gives me a completely different html code than the one I see when I inspect an element directly from Chrome, and the output of the program says "No tables found".
Am I doing something wrong or is the table just inaccessible via these methods?
Upvotes: 1
Views: 142
Reputation: 20052
There's a backend endpoint that serves the entire table as a JSON
, so why not just grab this?
Here's how:
import pandas as pd
import requests
from tabulate import tabulate
disease_list_url = "http://bioinfo.life.hust.edu.cn/lncRNASNP/api/exp_disease_list"
response = requests.get(disease_list_url).json()
df = pd.DataFrame(response["lncrna_gene_list"])
print(tabulate(df))
Output:
-- - -------- ----- --------------- -- -- --- --- -- --- -- -- --- ------------------------- -- -- --- - --
0 0 21961160 chr1 NONHSAT001955.2 45 3 347 526 3 602 49 32 94 Alzheimer's disease 14 0 80 0 2
1 0 20124551 chr1 NONHSAT007671.2 4 0 87 79 2 82 0 9 22 autoimmune disease 3 0 19 0 0
2 0 18982067 chr1 NONHSAT009623.2 0 22 62 92 15 101 0 0 72 AIDS 11 0 61 0 17
3 0 15478311 chr1 NONHSAT010193.2 35 0 227 325 0 326 44 26 173 affective disorders 41 0 132 0 0
4 0 23791884 chr1 NONHSAT010193.2 35 0 227 325 0 326 44 26 173 Autism spectrum disorder 41 0 132 0 0
5 0 19606485 chr1 NONHSAT010193.2 35 0 227 325 0 326 44 26 173 Autism spectrum disorder 41 0 132 0 0
6 0 22817756 chr1 NONHSAT010193.2 35 0 227 325 0 326 44 26 173 Autism spectrum disorder 41 0 132 0 0
7 0 22019903 chr11 NONHSAT017462.2 38 0 400 770 0 806 37 23 119 adrenocortical carcinomas 29 11 90 0 0
8 0 21954592 chr11 NONHSAT017462.2 38 0 400 770 0 806 37 23 119 atherosclerosis 29 11 90 0 0
9 0 22067257 chr11 NONHSAT017531.2 10 0 69 126 0 168 11 3 94 aging 15 0 79 0 0
10 0 10340388 chr11 NONHSAT018662.2 18 0 81 140 0 130 13 11 84 acute myeloid leukemia 10 0 74 0 0
11 0 17940140 chr11 NONHSAT018662.2 18 0 81 140 0 130 13 11 84 acute myeloid leukemia 10 0 74 0 0
12 0 18587408 chr11 NONHSAT024403.2 0 7 73 113 5 170 0 0 49 Alzheimer's disease 10 0 39 0 9
13 0 21785702 chr11 NONHSAT024403.2 0 7 73 113 5 170 0 0 49 Alzheimer's disease 10 0 39 0 9
14 0 22817756 chr11 NONHSAT024403.2 0 7 73 113 5 170 0 0 49 Alzheimer's disease 10 0 39 0 9
-- - -------- ----- --------------- -- -- --- --- -- --- -- -- --- ------------------------- -- -- --- - --
And if you want the second page just do this:
disease_list_url = "http://bioinfo.life.hust.edu.cn/lncRNASNP/api/exp_disease_list?page=2"
Upvotes: 1