ikop
ikop

Reputation: 1790

phantomjs not evaluating javascript tables

I am trying to scrape the data from the page http://empres-i.fao.org/empres-i/2/obd?idOutbreak=225334&rss=t. The data is contained in several tables that seem to be generated dynamically using javascript. The html source code only shows the containers (id container1and container2) but not the actual data itself. I tried using phantomjs (version 2.1.1.) on a windows 10 system using the following code

var url = 'http://empres-i.fao.org/empres-i/2/obd?idOutbreak=225334&rss=t';
var page = require('webpage').create();
page.open(url, function () {
    console.log(page.content);
    phantom.exit();
});

My plan is to scrape the evaluated html using phantomjs and then to extract the data I need using R. I know, R is probably not the best tool for this but it is what I am most familiar with and what we use in my company.

Using the code above, I however also just get the unevaluated source code with the empty containers and not the data (as I e.g. get when I save the webpage manually in firefox). Why is phantomjs not evaluating the javascript? What could I do to access the data?

I have little experience with webscaping and I would really appreciate if someone could point me in the right direction. And as Denzel Washington likes to say in Philadelphia, "please explain it to me as if I were a six year old". Thanks!

Upvotes: 0

Views: 99

Answers (1)

hrbrmstr
hrbrmstr

Reputation: 78832

No need for phantomjs (et al). Just use the hidden XHR API:

library(jsonlite)

str(fromJSON("http://empres-i.fao.org/empres-i/obdj?id=225334&lang=EN"))
## List of 31
##  $ outbreak              :List of 23
##   ..$ id                      : int 225334
##   ..$ reportingDate           : chr "Mar 23, 2017"
##   ..$ markerIcon              : chr "domestic_red.png"
##   ..$ localityName            : chr "Cullman"
##   ..$ localityQuality         : chr "Centroid Admin2"
##   ..$ region                  : chr "Americas"
##   ..$ country                 : chr "United States of America"
##   ..$ admin1                  : chr "Alabama"
##   ..$ latitude                : num 34.1
##   ..$ longitude               : num -86.9
##   ..$ status                  : chr "Confirmed"
##   ..$ disease                 : chr "Influenza - Avian"
##   ..$ serotypes               : chr "H7N9 LPAI"
##   ..$ source                  : chr "National authorities"
##   ..$ speciesDescription      : chr "domestic, unspecified bird"
##   ..$ hasHumansAffected       : logi FALSE
##   ..$ humansAge               : int 0
##   ..$ speciesAffectedList     :'data.frame': 1 obs. of  5 variables:
##   .. ..$ id         : int 109831
##   .. ..$ idOutbreak : int 225334
##   .. ..$ animalType : chr "Domestic"
##   .. ..$ animalClass: chr "Birds"
##   .. ..$ species    : chr "Unspecified bird"
##   ..$ laboratoryTestList      :'data.frame': 1 obs. of  6 variables:
##   .. ..$ id                 : int 74563
##   .. ..$ idOutbreak         : int 225334
##   .. ..$ formattedResultDate: chr "22/03/2017"
##   .. ..$ diseaseTested      : chr "Influenza - Avian"
##   .. ..$ speciesTested      : chr "Unspecified bird"
##   .. ..$ result             : chr "Positive"
##   ..$ sibMatchedIsolateList   : list()
##   ..$ formattedObservationDate: chr "23/03/2017"
##   ..$ formattedReportingDate  : chr "23/03/2017"
##   ..$ idWorkspace             : chr "empresi"
##  $ strGeneralInfo        : chr "GENERAL INFO"
##  $ strDiseaseEventID     : chr "Disease Event ID"
##  $ strReportingDate      : chr "Reporting date"
##  $ strObservationDate    : chr "Observation date"
##  $ strLocation           : chr "LOCATION"
##  $ strRegion             : chr "Region"
##  $ strAdmin1             : chr "Admin 1 (Country)"
##  $ strLocality           : chr "Locality"
##  $ strLatLong            : chr "Lat/Long"
##  $ strCoordsQuality      : chr "Quality of Coordinates"
##  $ strDisease            : chr "DISEASE"
##  $ strStatus             : chr "Status"
##  $ strSerotypes          : chr "Serotype"
##  $ strSource             : chr "Source"
##  $ strSpeciesAffected    : chr "SPECIES AFFECTED"
##  $ strAnType             : chr "An.Type"
##  $ strAnClass            : chr "An.Class"
##  $ strSpecies            : chr "Species"
##  $ strAtRisk             : chr "At Risk"
##  $ strCases              : chr "Cases"
##  $ strDeaths             : chr "Deaths"
##  $ strDestroyed          : chr "Destroyed"
##  $ strSlaughtered        : chr "Slaughtered"
##  $ strTest               : chr "Test"
##  $ strResult             : chr "Result"
##  $ strResultDate         : chr "Result Date"
##  $ strDiseaseTested      : chr "Disease Tested"
##  $ strReferenceLaboratory: chr "Reference Laboratory"
##  $ strLaboratory         : chr "LABORATORIES"
##  $ strPageTitle          : chr "Disease Event Details"

Upvotes: 2

Related Questions