Reputation: 4929
I am trying to scrape the "Traits" table from this website https://www.ebi.ac.uk/gwas/genes/SAMD12 (actually, the URL can change according to my necessity, but the structure will be the same).
The problem is that my knowledge is quite limited in web scraping, and I can't get this table using the basic BeautifulSoup workflow I've seen up to here.
Here's my code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ebi.ac.uk/gwas/genes/SAMD12'
page = requests.get(url)
I'm looking for the "efotrait-table":
efotrait = soup.find('div', id='efotrait-table-loading')
print(efotrait.prettify())
<div class="row" id="efotrait-table-loading" style="margin-top:20px">
<div class="panel panel-default" id="efotrait_panel">
<div class="panel-heading background-color-primary-accent">
<h3 class="panel-title">
<span class="efotrait_label">
Traits
</span>
<span class="efotrait_count badge available-data-btn-badge">
</span>
</h3>
<span class="pull-right">
<span class="clickable" onclick="toggleSidebar('#efotrait_panel span.clickable')" style="margin-left:25px">
<span class="glyphicon glyphicon-chevron-up">
</span>
</span>
</span>
</div>
<div class="panel-body">
<table class="table table-striped borderless" data-export-types="['csv']" data-filter-control="true" data-flat="true" data-icons="icons" data-search="true" data-show-columns="true" data-show-export="true" data-show-multi-sort="false" data-sort-name="numberAssociations" data-sort-order="desc" id="efotrait-table">
</table>
</div>
</div>
</div>
Specifically, this one:
soup.select('table#efotrait-table')[0]
<table class="table table-striped borderless" data-export-types="['csv']" data-filter-control="true" data-flat="true" data-icons="icons" data-search="true" data-show-columns="true" data-show-export="true" data-show-multi-sort="false" data-sort-name="numberAssociations" data-sort-order="desc" id="efotrait-table">
</table>
As you can see, the table's content doesn't show up. In the website, there's an option for saving the table as csv. It would be awesome if I get this downloadable link somehow. But when I click in the link in order to copy it, I get "javascript:void(0)" instead. I've not studied javascript, should I?
The table is hidden, and even if it's not, I would need to interactively select more rows per page to get the whole table (and the URL doesn't change, so I can't get the table either).
I would like to know a way to get access to this table programmatically (unstructured info), then the minors about organizing the table will be fine. Any clues for how doing that (or what I should study) will be greatly appreciated.
Thanks in advance
Upvotes: 0
Views: 1042
Reputation: 11525
Desired data is available within API call.
import requests
data = {
"q": "ensemblMappedGenes: \"SAMD12\" OR association_ensemblMappedGenes: \"SAMD12\"",
"max": "99999",
"group.limit": "99999",
"group.field": "resourcename",
"facet.field": "resourcename",
"hl.fl": "shortForm,efoLink",
"hl.snippets": "100",
"fl": "accessionId,ancestralGroups,ancestryLinks,associationCount,association_rsId,authorAscii_s,author_s,authorsList,betaDirection,betaNum,betaUnit,catalogPublishDate,chromLocation,chromosomeName,chromosomePosition,context,countriesOfRecruitment,currentSnp,efoLink,ensemblMappedGenes,fullPvalueSet,genotypingTechnologies,id,initialSampleDescription,label,labelda,mappedLabel,mappedUri,merged,multiSnpHaplotype,numberOfIndividuals,orPerCopyNum,orcid_s,pValueExponent,pValueMantissa,parent,positionLinks,publication,publicationDate,publicationLink,pubmedId,qualifier,range,region,replicateSampleDescription,reportedGene,resourcename,riskFrequency,rsId,shortForm,snpInteraction,strongestAllele,studyId,synonym,title,traitName,traitName_s,traitUri,platform",
"raw": "fq:resourcename:association or resourcename:study"
}
def main(url):
r = requests.post(url, data=data).json()
print(r)
main("https://www.ebi.ac.uk/gwas/api/search/advancefilter")
You can follow the r.keys()
and load your desired data by access the dict.
But here's a quick load (Lazy Code):
import requests
import re
import pandas as pd
data = {
"q": "ensemblMappedGenes: \"SAMD12\" OR association_ensemblMappedGenes: \"SAMD12\"",
"max": "99999",
"group.limit": "99999",
"group.field": "resourcename",
"facet.field": "resourcename",
"hl.fl": "shortForm,efoLink",
"hl.snippets": "100",
"fl": "accessionId,ancestralGroups,ancestryLinks,associationCount,association_rsId,authorAscii_s,author_s,authorsList,betaDirection,betaNum,betaUnit,catalogPublishDate,chromLocation,chromosomeName,chromosomePosition,context,countriesOfRecruitment,currentSnp,efoLink,ensemblMappedGenes,fullPvalueSet,genotypingTechnologies,id,initialSampleDescription,label,labelda,mappedLabel,mappedUri,merged,multiSnpHaplotype,numberOfIndividuals,orPerCopyNum,orcid_s,pValueExponent,pValueMantissa,parent,positionLinks,publication,publicationDate,publicationLink,pubmedId,qualifier,range,region,replicateSampleDescription,reportedGene,resourcename,riskFrequency,rsId,shortForm,snpInteraction,strongestAllele,studyId,synonym,title,traitName,traitName_s,traitUri,platform",
"raw": "fq:resourcename:association or resourcename:study"
}
def main(url):
r = requests.post(url, data=data)
match = {item.group(2, 1) for item in re.finditer(
r'traitName_s":\"(.*?)\".*?mappedLabel":\["(.*?)\"', r.text)}
df = pd.DataFrame.from_dict(match)
print(df)
main("https://www.ebi.ac.uk/gwas/api/search/advancefilter")
Output:
0 heel bone mineral density Heel bone mineral density
1 interleukin-8 measurement Chronic obstructive pulmonary disease-related ...
2 self reported educational attainment Educational attainment (years of education)
3 waist-hip ratio Waist-hip ratio
4 eye morphology measurement Eye morphology
5 CC16 measurement Chronic obstructive pulmonary disease-related ...
6 age-related hearing impairment Age-related hearing impairment (SNP x SNP inte...
7 eosinophil percentage of leukocytes Eosinophil percentage of white cells
8 coronary artery calcification Coronary artery calcified atherosclerotic plaq...
9 multiple sclerosis Multiple sclerosis
10 mathematical ability Highest math class taken (MTAG)
11 risk-taking behaviour General risk tolerance (MTAG)
12 coronary artery calcification Coronary artery calcified atherosclerotic plaq...
13 self reported educational attainment Educational attainment (MTAG)
14 pancreatitis Pancreatitis
15 hair colour measurement Hair color
16 breast carcinoma Breast cancer specific mortality in breast cancer
17 eosinophil count Eosinophil counts
18 self rated health Self-rated health
19 bone density Bone mineral density
Upvotes: 3