Reputation: 9752
Bit of a difficult question to put into one sentence, but I am trying to scrape some html from the following page
http://www.ncbi.nlm.nih.gov/snp/?term=(human[Organism])+AND+GLRA3[Gene Name]
where I can scrape what I need using R, but because the browser only displays the first 20 entries, only the corresponding html is available to me. This causes a problem, because I want to scrape all entries, not just the entries that are being served up by the page on the browser. Any way, here is my R code
library(XML)
library(httr)
#Go to Nectar Mutation and get SNP refs
dbsnp.searchterm="(human[Organism])+AND+GLRA1[Gene Name]"
dbsnp.url=paste0("http://www.ncbi.nlm.nih.gov/snp/?term=",dbsnp.searchterm)
dbsnp.get=GET(dbsnp.url)
dbsnp.content=content(dbsnp.get, as="text")
links<-xpathSApply(htmlParse(dbsnp.content), "//a[contains(@href, 'snp_ref')]",xmlGetAttr,"href")
and result
> links
[1] "/projects/SNP/snp_ref.cgi?rs=116474260"
[2] "/projects/SNP/snp_ref.cgi?rs=121918408"
[3] "/projects/SNP/snp_ref.cgi?rs=121918409"
[4] "/projects/SNP/snp_ref.cgi?rs=121918410"
[5] "/projects/SNP/snp_ref.cgi?rs=121918411"
[6] "/projects/SNP/snp_ref.cgi?rs=121918412"
[7] "/projects/SNP/snp_ref.cgi?rs=121918413"
[8] "/projects/SNP/snp_ref.cgi?rs=121918414"
[9] "/projects/SNP/snp_ref.cgi?rs=121918415"
[10] "/projects/SNP/snp_ref.cgi?rs=121918416"
[11] "/projects/SNP/snp_ref.cgi?rs=121918417"
[12] "/projects/SNP/snp_ref.cgi?rs=121918418"
[13] "/projects/SNP/snp_ref.cgi?rs=267600494"
[14] "/projects/SNP/snp_ref.cgi?rs=267606848"
[15] "/projects/SNP/snp_ref.cgi?rs=281864912"
[16] "/projects/SNP/snp_ref.cgi?rs=281864913"
[17] "/projects/SNP/snp_ref.cgi?rs=281864914"
[18] "/projects/SNP/snp_ref.cgi?rs=281864915"
[19] "/projects/SNP/snp_ref.cgi?rs=281864916"
[20] "/projects/SNP/snp_ref.cgi?rs=281864917"
You will note that there are 4058 entries that I require.
Upvotes: 3
Views: 208
Reputation: 1568
You will want to use the api that @Roost found. I will add that httr has a built-in method for adding query parameters, and you should use it because it will automatically URL encode your query parameters for you.
In XML, using xmlToList
will be easier if you are not as comfortable with xPath, but you can pick your way to parse the XML.
library(XML)
library(httr)
# Go to api and get Count
dbsnp.searchterm <- "(human[Organism]) AND GLRA3[Gene Name]"
dbsnp.url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
dbsnp.get <- GET(dbsnp.url, query=list(db="SNP", term=dbsnp.searchterm))
dbsnp.content <- content(dbsnp.get, as="text")
dbsnp.xml <- xmlParse(dbsnp.content)
max_count <- xmlToList(dbsnp.xml)$Count
# Use the Count to form the query that you want
dbsnp.full.get <- GET(dbsnp.url, query=list(
db="SNP",
term=dbsnp.searchterm,
RetMax=max_count))
dbsnp.full.content <- content(dbsnp.full.get, as="text")
dbsnp.full.xml <- xmlParse(dbsnp.full.content)
dbsnp.full.list <- xmlToList(dbsnp.full.xml)
prefix <- "/projects/SNP/snp_ref.cgi?rs="
dbsnp.links <- paste0(prefix, unlist(dbsnp.full.list$IdList))
Upvotes: 1
Reputation: 86
It took me the entire afternoon and I only still have halve of the solution (first time working with XML whatsoever). Anyway I figured out that you can use the following link to get to the results in XML format;
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=SNP&term=(human[Organism])+AND+GLRA3[Gene+Name]
Where db
stands for the database you want to search in, and term
is fairly self-explanatory.
On top of the results you will then see;
<Count>4736</Count>
<RetMax>20</RetMax>
And under this the ID-list starts and shows 20 ID's which are equivalent to the rs
value in;
/projects/SNP/snp_ref.cgi?rs=
116474260
You can use the GET
function to get this information in R. Now if you can figure out a way to let R read the number that in on the Count
row (which is the amount of results that are possible), and then use the GET
function again but now with &RetMax=X
added to the end of the link, where X is the number in Count
row.
For example;
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=SNP&term=(human[Organism])+AND+GLRA3[Gene+Name]&RetMax=4736
Now all the ID's are imported in R (again I lack the skills to extract them nicely from the data, so that might be for someone else to figure out).
Hope this helps!
Upvotes: 1