Meenakshi Vikram
Meenakshi Vikram

Reputation: 19

Unable to web scrape contents using rvest

I am looking to web scrape all the codes and the codes under each hierarchy as seen on the left panel from this website using R package rvest.

URL-- http://apps.who.int/classifications/icd10/browse/2016/en/

To begin with I tried this code-

url<-"http://apps.who.int/classifications/icd10/browse/2016/en/"
src<-read_html(url)
ATC<-src%>%html_node("a.ygtvlabel")%>%html_text

a.ygtvlbel is the class I see when hovering on the text in the web page.

However it just returns NA_character. I see that the html source for the page, does not directly contain these codes(Ex- parasitic diseases) but instead it probably has an href to all the contents.

How Can I go about scraping such a page. Kindly advise.

Upvotes: 0

Views: 571

Answers (1)

Andrew Lavers
Andrew Lavers

Reputation: 4378

As with many of these kinds of pages, this page makes a background request for a json file that contains the data. This can be discovered by using browser debug tools and looking at the network requests. Using an API as noted in comments is a better choice

library(httr)
library(jsonlite)

## original url<-"http://apps.who.int/classifications/icd10/browse/2016/en/"

json_url <- "http://apps.who.int/classifications/icd10/browse/2016/en/JsonGetRootConcepts?useHtml=false"
json_data <- rawToChar(GET(json_url)$content)

categories <- fromJSON(json_data)
categories$label
# [1] "I Certain infectious and parasitic diseases"                                                            
# [2] "II Neoplasms"                                                                                           
# [3] "III Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism"
# [4] "IV Endocrine, nutritional and metabolic diseases"                                                       
# gories$label

Upvotes: 1

Related Questions