Reputation: 1790
I am trying to scrape the data from the page http://empres-i.fao.org/empres-i/2/obd?idOutbreak=225334&rss=t. The data is contained in several tables that seem to be generated dynamically using javascript. The html source code only shows the containers (id container1
and container2
) but not the actual data itself. I tried using phantomjs (version 2.1.1.) on a windows 10 system using the following code
var url = 'http://empres-i.fao.org/empres-i/2/obd?idOutbreak=225334&rss=t';
var page = require('webpage').create();
page.open(url, function () {
console.log(page.content);
phantom.exit();
});
My plan is to scrape the evaluated html using phantomjs and then to extract the data I need using R. I know, R is probably not the best tool for this but it is what I am most familiar with and what we use in my company.
Using the code above, I however also just get the unevaluated source code with the empty containers and not the data (as I e.g. get when I save the webpage manually in firefox). Why is phantomjs not evaluating the javascript? What could I do to access the data?
I have little experience with webscaping and I would really appreciate if someone could point me in the right direction. And as Denzel Washington likes to say in Philadelphia, "please explain it to me as if I were a six year old". Thanks!
Upvotes: 0
Views: 99
Reputation: 78832
No need for phantomjs (et al). Just use the hidden XHR API:
library(jsonlite)
str(fromJSON("http://empres-i.fao.org/empres-i/obdj?id=225334&lang=EN"))
## List of 31
## $ outbreak :List of 23
## ..$ id : int 225334
## ..$ reportingDate : chr "Mar 23, 2017"
## ..$ markerIcon : chr "domestic_red.png"
## ..$ localityName : chr "Cullman"
## ..$ localityQuality : chr "Centroid Admin2"
## ..$ region : chr "Americas"
## ..$ country : chr "United States of America"
## ..$ admin1 : chr "Alabama"
## ..$ latitude : num 34.1
## ..$ longitude : num -86.9
## ..$ status : chr "Confirmed"
## ..$ disease : chr "Influenza - Avian"
## ..$ serotypes : chr "H7N9 LPAI"
## ..$ source : chr "National authorities"
## ..$ speciesDescription : chr "domestic, unspecified bird"
## ..$ hasHumansAffected : logi FALSE
## ..$ humansAge : int 0
## ..$ speciesAffectedList :'data.frame': 1 obs. of 5 variables:
## .. ..$ id : int 109831
## .. ..$ idOutbreak : int 225334
## .. ..$ animalType : chr "Domestic"
## .. ..$ animalClass: chr "Birds"
## .. ..$ species : chr "Unspecified bird"
## ..$ laboratoryTestList :'data.frame': 1 obs. of 6 variables:
## .. ..$ id : int 74563
## .. ..$ idOutbreak : int 225334
## .. ..$ formattedResultDate: chr "22/03/2017"
## .. ..$ diseaseTested : chr "Influenza - Avian"
## .. ..$ speciesTested : chr "Unspecified bird"
## .. ..$ result : chr "Positive"
## ..$ sibMatchedIsolateList : list()
## ..$ formattedObservationDate: chr "23/03/2017"
## ..$ formattedReportingDate : chr "23/03/2017"
## ..$ idWorkspace : chr "empresi"
## $ strGeneralInfo : chr "GENERAL INFO"
## $ strDiseaseEventID : chr "Disease Event ID"
## $ strReportingDate : chr "Reporting date"
## $ strObservationDate : chr "Observation date"
## $ strLocation : chr "LOCATION"
## $ strRegion : chr "Region"
## $ strAdmin1 : chr "Admin 1 (Country)"
## $ strLocality : chr "Locality"
## $ strLatLong : chr "Lat/Long"
## $ strCoordsQuality : chr "Quality of Coordinates"
## $ strDisease : chr "DISEASE"
## $ strStatus : chr "Status"
## $ strSerotypes : chr "Serotype"
## $ strSource : chr "Source"
## $ strSpeciesAffected : chr "SPECIES AFFECTED"
## $ strAnType : chr "An.Type"
## $ strAnClass : chr "An.Class"
## $ strSpecies : chr "Species"
## $ strAtRisk : chr "At Risk"
## $ strCases : chr "Cases"
## $ strDeaths : chr "Deaths"
## $ strDestroyed : chr "Destroyed"
## $ strSlaughtered : chr "Slaughtered"
## $ strTest : chr "Test"
## $ strResult : chr "Result"
## $ strResultDate : chr "Result Date"
## $ strDiseaseTested : chr "Disease Tested"
## $ strReferenceLaboratory: chr "Reference Laboratory"
## $ strLaboratory : chr "LABORATORIES"
## $ strPageTitle : chr "Disease Event Details"
Upvotes: 2