Web scraping a hidden table using Python

I am trying to scrape the "Traits" table from this website https://www.ebi.ac.uk/gwas/genes/SAMD12 (actually, the URL can change according to my necessity, but the structure will be the same).

The problem is that my knowledge is quite limited in web scraping, and I can't get this table using the basic BeautifulSoup workflow I've seen up to here.

Here's my code:

import requests
from bs4 import BeautifulSoup

url = 'https://www.ebi.ac.uk/gwas/genes/SAMD12'
page = requests.get(url)

I'm looking for the "efotrait-table":

efotrait = soup.find('div', id='efotrait-table-loading')
print(efotrait.prettify())
<div class="row" id="efotrait-table-loading" style="margin-top:20px">
 <div class="panel panel-default" id="efotrait_panel">
  <div class="panel-heading background-color-primary-accent">
   <h3 class="panel-title">
    <span class="efotrait_label">
     Traits
    </span>
    <span class="efotrait_count badge available-data-btn-badge">
    </span>
   </h3>
   <span class="pull-right">
    <span class="clickable" onclick="toggleSidebar('#efotrait_panel span.clickable')" style="margin-left:25px">
     <span class="glyphicon glyphicon-chevron-up">
     </span>
    </span>
   </span>
  </div>
  <div class="panel-body">
   <table class="table table-striped borderless" data-export-types="['csv']" data-filter-control="true" data-flat="true" data-icons="icons" data-search="true" data-show-columns="true" data-show-export="true" data-show-multi-sort="false" data-sort-name="numberAssociations" data-sort-order="desc" id="efotrait-table">
   </table>
  </div>
 </div>
</div>

Specifically, this one:

soup.select('table#efotrait-table')[0]
<table class="table table-striped borderless" data-export-types="['csv']" data-filter-control="true" data-flat="true" data-icons="icons" data-search="true" data-show-columns="true" data-show-export="true" data-show-multi-sort="false" data-sort-name="numberAssociations" data-sort-order="desc" id="efotrait-table">
</table>

As you can see, the table's content doesn't show up. In the website, there's an option for saving the table as csv. It would be awesome if I get this downloadable link somehow. But when I click in the link in order to copy it, I get "javascript:void(0)" instead. I've not studied javascript, should I?

The table is hidden, and even if it's not, I would need to interactively select more rows per page to get the whole table (and the URL doesn't change, so I can't get the table either).

I would like to know a way to get access to this table programmatically (unstructured info), then the minors about organizing the table will be fine. Any clues for how doing that (or what I should study) will be greatly appreciated.

Thanks in advance

Upvotes: 0

Views: 1042

Answers (1)

Desired data is available within API call.

import requests

data = {
    "q": "ensemblMappedGenes: \"SAMD12\" OR association_ensemblMappedGenes: \"SAMD12\"",
    "max": "99999",
    "group.limit": "99999",
    "group.field": "resourcename",
    "facet.field": "resourcename",
    "hl.fl": "shortForm,efoLink",
    "hl.snippets": "100",
    "fl": "accessionId,ancestralGroups,ancestryLinks,associationCount,association_rsId,authorAscii_s,author_s,authorsList,betaDirection,betaNum,betaUnit,catalogPublishDate,chromLocation,chromosomeName,chromosomePosition,context,countriesOfRecruitment,currentSnp,efoLink,ensemblMappedGenes,fullPvalueSet,genotypingTechnologies,id,initialSampleDescription,label,labelda,mappedLabel,mappedUri,merged,multiSnpHaplotype,numberOfIndividuals,orPerCopyNum,orcid_s,pValueExponent,pValueMantissa,parent,positionLinks,publication,publicationDate,publicationLink,pubmedId,qualifier,range,region,replicateSampleDescription,reportedGene,resourcename,riskFrequency,rsId,shortForm,snpInteraction,strongestAllele,studyId,synonym,title,traitName,traitName_s,traitUri,platform",
    "raw": "fq:resourcename:association or resourcename:study"
}


def main(url):
    r = requests.post(url, data=data).json()
    print(r)


main("https://www.ebi.ac.uk/gwas/api/search/advancefilter")

You can follow the r.keys() and load your desired data by access the dict.

But here's a quick load (Lazy Code):

import requests
import re
import pandas as pd

data = {
    "q": "ensemblMappedGenes: \"SAMD12\" OR association_ensemblMappedGenes: \"SAMD12\"",
    "max": "99999",
    "group.limit": "99999",
    "group.field": "resourcename",
    "facet.field": "resourcename",
    "hl.fl": "shortForm,efoLink",
    "hl.snippets": "100",
    "fl": "accessionId,ancestralGroups,ancestryLinks,associationCount,association_rsId,authorAscii_s,author_s,authorsList,betaDirection,betaNum,betaUnit,catalogPublishDate,chromLocation,chromosomeName,chromosomePosition,context,countriesOfRecruitment,currentSnp,efoLink,ensemblMappedGenes,fullPvalueSet,genotypingTechnologies,id,initialSampleDescription,label,labelda,mappedLabel,mappedUri,merged,multiSnpHaplotype,numberOfIndividuals,orPerCopyNum,orcid_s,pValueExponent,pValueMantissa,parent,positionLinks,publication,publicationDate,publicationLink,pubmedId,qualifier,range,region,replicateSampleDescription,reportedGene,resourcename,riskFrequency,rsId,shortForm,snpInteraction,strongestAllele,studyId,synonym,title,traitName,traitName_s,traitUri,platform",
    "raw": "fq:resourcename:association or resourcename:study"
}


def main(url):
    r = requests.post(url, data=data)
    match = {item.group(2, 1) for item in re.finditer(
        r'traitName_s":\"(.*?)\".*?mappedLabel":\["(.*?)\"', r.text)}
    df = pd.DataFrame.from_dict(match)
    print(df)


main("https://www.ebi.ac.uk/gwas/api/search/advancefilter")

Output:

0              heel bone mineral density                          Heel bone mineral density
1              interleukin-8 measurement  Chronic obstructive pulmonary disease-related ...
2   self reported educational attainment        Educational attainment (years of education)
3                        waist-hip ratio                                    Waist-hip ratio
4             eye morphology measurement                                     Eye morphology
5                       CC16 measurement  Chronic obstructive pulmonary disease-related ...
6         age-related hearing impairment  Age-related hearing impairment (SNP x SNP inte...
7    eosinophil percentage of leukocytes               Eosinophil percentage of white cells
8          coronary artery calcification  Coronary artery calcified atherosclerotic plaq...
9                     multiple sclerosis                                 Multiple sclerosis
10                  mathematical ability                    Highest math class taken (MTAG)
11                 risk-taking behaviour                      General risk tolerance (MTAG)
12         coronary artery calcification  Coronary artery calcified atherosclerotic plaq...
13  self reported educational attainment                      Educational attainment (MTAG)
14                          pancreatitis                                       Pancreatitis
15               hair colour measurement                                         Hair color
16                      breast carcinoma  Breast cancer specific mortality in breast cancer
17                      eosinophil count                                  Eosinophil counts
18                     self rated health                                  Self-rated health
19                          bone density                               Bone mineral density

Upvotes: 3

Related Questions