brucezepplin
brucezepplin

Reputation: 9752

R scraping entire html, not just page view

Bit of a difficult question to put into one sentence, but I am trying to scrape some html from the following page

http://www.ncbi.nlm.nih.gov/snp/?term=(human[Organism])+AND+GLRA3[Gene Name]

where I can scrape what I need using R, but because the browser only displays the first 20 entries, only the corresponding html is available to me. This causes a problem, because I want to scrape all entries, not just the entries that are being served up by the page on the browser. Any way, here is my R code

library(XML)
library(httr)

#Go to Nectar Mutation and get SNP refs
dbsnp.searchterm="(human[Organism])+AND+GLRA1[Gene Name]"
dbsnp.url=paste0("http://www.ncbi.nlm.nih.gov/snp/?term=",dbsnp.searchterm)
dbsnp.get=GET(dbsnp.url)
dbsnp.content=content(dbsnp.get, as="text")
links<-xpathSApply(htmlParse(dbsnp.content), "//a[contains(@href, 'snp_ref')]",xmlGetAttr,"href")

and result

> links
 [1] "/projects/SNP/snp_ref.cgi?rs=116474260"
 [2] "/projects/SNP/snp_ref.cgi?rs=121918408"
 [3] "/projects/SNP/snp_ref.cgi?rs=121918409"
 [4] "/projects/SNP/snp_ref.cgi?rs=121918410"
 [5] "/projects/SNP/snp_ref.cgi?rs=121918411"
 [6] "/projects/SNP/snp_ref.cgi?rs=121918412"
 [7] "/projects/SNP/snp_ref.cgi?rs=121918413"
 [8] "/projects/SNP/snp_ref.cgi?rs=121918414"
 [9] "/projects/SNP/snp_ref.cgi?rs=121918415"
[10] "/projects/SNP/snp_ref.cgi?rs=121918416"
[11] "/projects/SNP/snp_ref.cgi?rs=121918417"
[12] "/projects/SNP/snp_ref.cgi?rs=121918418"
[13] "/projects/SNP/snp_ref.cgi?rs=267600494"
[14] "/projects/SNP/snp_ref.cgi?rs=267606848"
[15] "/projects/SNP/snp_ref.cgi?rs=281864912"
[16] "/projects/SNP/snp_ref.cgi?rs=281864913"
[17] "/projects/SNP/snp_ref.cgi?rs=281864914"
[18] "/projects/SNP/snp_ref.cgi?rs=281864915"
[19] "/projects/SNP/snp_ref.cgi?rs=281864916"
[20] "/projects/SNP/snp_ref.cgi?rs=281864917"

You will note that there are 4058 entries that I require.

Upvotes: 3

Views: 208

Answers (2)

waternova
waternova

Reputation: 1568

You will want to use the api that @Roost found. I will add that httr has a built-in method for adding query parameters, and you should use it because it will automatically URL encode your query parameters for you.

In XML, using xmlToList will be easier if you are not as comfortable with xPath, but you can pick your way to parse the XML.

library(XML)
library(httr)

# Go to api and get Count
dbsnp.searchterm <- "(human[Organism]) AND GLRA3[Gene Name]"
dbsnp.url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
dbsnp.get <- GET(dbsnp.url, query=list(db="SNP", term=dbsnp.searchterm))
dbsnp.content <- content(dbsnp.get, as="text")
dbsnp.xml <- xmlParse(dbsnp.content)

max_count <- xmlToList(dbsnp.xml)$Count

# Use the Count to form the query that you want
dbsnp.full.get <- GET(dbsnp.url, query=list(
    db="SNP", 
    term=dbsnp.searchterm, 
    RetMax=max_count))
dbsnp.full.content <- content(dbsnp.full.get, as="text")
dbsnp.full.xml <- xmlParse(dbsnp.full.content)
dbsnp.full.list <- xmlToList(dbsnp.full.xml)

prefix <- "/projects/SNP/snp_ref.cgi?rs="

dbsnp.links <- paste0(prefix, unlist(dbsnp.full.list$IdList))

Upvotes: 1

Joost
Joost

Reputation: 86

It took me the entire afternoon and I only still have halve of the solution (first time working with XML whatsoever). Anyway I figured out that you can use the following link to get to the results in XML format;

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=SNP&term=(human[Organism])+AND+GLRA3[Gene+Name]

Where db stands for the database you want to search in, and term is fairly self-explanatory.

On top of the results you will then see;

<Count>4736</Count>
<RetMax>20</RetMax>

And under this the ID-list starts and shows 20 ID's which are equivalent to the rs value in;

/projects/SNP/snp_ref.cgi?rs=116474260

You can use the GET function to get this information in R. Now if you can figure out a way to let R read the number that in on the Count row (which is the amount of results that are possible), and then use the GET function again but now with &RetMax=X added to the end of the link, where X is the number in Count row.

For example;

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=SNP&term=(human[Organism])+AND+GLRA3[Gene+Name]&RetMax=4736

Now all the ID's are imported in R (again I lack the skills to extract them nicely from the data, so that might be for someone else to figure out).

Hope this helps!

Upvotes: 1

Related Questions