Dee
Dee

Reputation: 3

Converting XML to dataframe in R studio

I am not a coder, but trying to learn R. I have scraped these jobs from indeed and need in dataframe for analysis. My file is here

However I used this code:

install.packages("XML")
library("XML")
library("methods")
results <- xmlParse("http://api.indeed.com/ads/apisearch?publisher=8693092939388569&q=data+scientist&sort=&radius=&st=&jt=&start=&limit=2000&fromage=&filter=&latlong=1&co=in&chnl=&userip=1.2.3.4&useragent=Mozilla/%2F4.0%28Firefox%29&v=2", isURL=TRUE)
print(results)
rootnode <- xmlRoot(results)
rootsize <- xmlSize(rootnode)
print(rootsize)
> print(rootsize)
[1] 10

My problem starts in the following code(I think, the argument is not doing well):

xmldataframe <- xmlToDataFrame("http://api.indeed.com/ads/apisearch?publisher=8693092939388569&q=data+scientist&sort=&radius=&st=&jt=&start=&limit=2000&fromage=&filter=&latlong=1&co=in&chnl=&userip=1.2.3.4&useragent=Mozilla/%2F4.0%28Firefox%29&v=2")
print(xmldataframe)

Error in `[<-.data.frame`(`*tmp*`, i, names(nodes[[i]]), value = c("Indian Council Of Medical Research (ICMR) Needs ScientistIndian Council of Medical Research (ICMR)INIndiaEmployment SamacharThu, 18 Aug 2016 16:16:15 GMTIndian Council Of Medical Research (ICMR) Needs Scientist. Indian Council of Medical Research (ICMR) invites applications to recruit on vacant posts of...http://www.indeed.co.in/viewjob?jk=20d1db3c7d973199&qd=704PFtVAS6xUi0-OukCaEmfxgGzxqabhMKv0iphFlwZvghJwQWAysomG7BsaL67IpeRHLNudzQ_v_UGEGMFYq0JvivwR6g0dNKs-MyZMxww&indpubnum=8693092939388569&atk=1arpjr78d5upddvtindeed_clk(this,'6618');20d1db3c7d973199falsefalsefalseIndia16 days ago",  : 
  duplicate subscripts for columns
> print(xmldataframe)
Error in print(xmldataframe) : object 'xmldataframe' not found

What am I doing wrong?

Upvotes: 0

Views: 930

Answers (2)

Rentrop
Rentrop

Reputation: 21497

You can use xml2 and purrr to do that as follows:

require(xml2)
require(purrr)
doc <- read_xml("http://api.indeed.com/ads/apisearch?publisher=8693092939388569&q=data+scientist&sort=&radius=&st=&jt=&start=&limit=2000&fromage=&filter=&latlong=1&co=in&chnl=&userip=1.2.3.4&useragent=Mozilla/%2F4.0%28Firefox%29&v=2")
html_table(doc)
doc %>% 
  xml_find_all("//results/result") %>% 
  map(xml_children) %>% 
  map_df(~map(setNames(xml_text(.), xml_name(.)), type.convert, as.is=TRUE))

Upvotes: 1

Parfait
Parfait

Reputation: 107567

In order to use xmlToDataFrame(), you need to first parse the XML document and then reference document and the repeated elements that will serve as your rows. Fortunately, the XML is not heavily nested to require data transformation/wrangling needs.

library("XML")

# PARSE DOCUMENT FROM URL (paste0 used to break up line for readability)
results <- xmlParse(paste0("http://api.indeed.com/ads/apisearch?publisher=8693092939388569",
                           "&q=data+scientist&sort=&radius=&st=&jt=&start=&limit=2000",
                           "&fromage=&filter=&latlong=1&co=in&chnl=&userip=1.2.3.4",
                           "&useragent=Mozilla/%2F4.0%28Firefox%29&v=2"), isURL=TRUE)

# CONVERT TO DATA FRAME ON <result> NODE
df <- xmlToDataFrame(nodes = getNodeSet(results, "//results/result"))

Screenshot output (25 obs of 19 vars):

XML dataframe result

Upvotes: 2

Related Questions