Reputation: 3
I am not a coder, but trying to learn R. I have scraped these jobs from indeed and need in dataframe for analysis. My file is here
However I used this code:
install.packages("XML")
library("XML")
library("methods")
results <- xmlParse("http://api.indeed.com/ads/apisearch?publisher=8693092939388569&q=data+scientist&sort=&radius=&st=&jt=&start=&limit=2000&fromage=&filter=&latlong=1&co=in&chnl=&userip=1.2.3.4&useragent=Mozilla/%2F4.0%28Firefox%29&v=2", isURL=TRUE)
print(results)
rootnode <- xmlRoot(results)
rootsize <- xmlSize(rootnode)
print(rootsize)
> print(rootsize)
[1] 10
My problem starts in the following code(I think, the argument is not doing well):
xmldataframe <- xmlToDataFrame("http://api.indeed.com/ads/apisearch?publisher=8693092939388569&q=data+scientist&sort=&radius=&st=&jt=&start=&limit=2000&fromage=&filter=&latlong=1&co=in&chnl=&userip=1.2.3.4&useragent=Mozilla/%2F4.0%28Firefox%29&v=2")
print(xmldataframe)
Error in `[<-.data.frame`(`*tmp*`, i, names(nodes[[i]]), value = c("Indian Council Of Medical Research (ICMR) Needs ScientistIndian Council of Medical Research (ICMR)INIndiaEmployment SamacharThu, 18 Aug 2016 16:16:15 GMTIndian Council Of Medical Research (ICMR) Needs Scientist. Indian Council of Medical Research (ICMR) invites applications to recruit on vacant posts of...http://www.indeed.co.in/viewjob?jk=20d1db3c7d973199&qd=704PFtVAS6xUi0-OukCaEmfxgGzxqabhMKv0iphFlwZvghJwQWAysomG7BsaL67IpeRHLNudzQ_v_UGEGMFYq0JvivwR6g0dNKs-MyZMxww&indpubnum=8693092939388569&atk=1arpjr78d5upddvtindeed_clk(this,'6618');20d1db3c7d973199falsefalsefalseIndia16 days ago", :
duplicate subscripts for columns
> print(xmldataframe)
Error in print(xmldataframe) : object 'xmldataframe' not found
What am I doing wrong?
Upvotes: 0
Views: 930
Reputation: 21497
You can use xml2
and purrr
to do that as follows:
require(xml2)
require(purrr)
doc <- read_xml("http://api.indeed.com/ads/apisearch?publisher=8693092939388569&q=data+scientist&sort=&radius=&st=&jt=&start=&limit=2000&fromage=&filter=&latlong=1&co=in&chnl=&userip=1.2.3.4&useragent=Mozilla/%2F4.0%28Firefox%29&v=2")
html_table(doc)
doc %>%
xml_find_all("//results/result") %>%
map(xml_children) %>%
map_df(~map(setNames(xml_text(.), xml_name(.)), type.convert, as.is=TRUE))
Upvotes: 1
Reputation: 107567
In order to use xmlToDataFrame()
, you need to first parse the XML document and then reference document and the repeated elements that will serve as your rows. Fortunately, the XML is not heavily nested to require data transformation/wrangling needs.
library("XML")
# PARSE DOCUMENT FROM URL (paste0 used to break up line for readability)
results <- xmlParse(paste0("http://api.indeed.com/ads/apisearch?publisher=8693092939388569",
"&q=data+scientist&sort=&radius=&st=&jt=&start=&limit=2000",
"&fromage=&filter=&latlong=1&co=in&chnl=&userip=1.2.3.4",
"&useragent=Mozilla/%2F4.0%28Firefox%29&v=2"), isURL=TRUE)
# CONVERT TO DATA FRAME ON <result> NODE
df <- xmlToDataFrame(nodes = getNodeSet(results, "//results/result"))
Screenshot output (25 obs of 19 vars):
Upvotes: 2