Reputation: 13
I have searched a lot but unable to find a good solution to my problem.
I am trying to automate some of my work and scrape some data from a site that my company uses. (FYI - TOS do not seem to indicate they don't want to be scraped in case anyone is wondering. So I should be good.)
So far I have the following code
library(devtools)
library(RSelenium)
library(XML)
library(rvest)
library(stringr)
library(dplyr)
library(knitr)
library(magrittr)
library(rjson)
library(stringi)
#login
appURL <- 'URL I Am accessing/'
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantomjs")
remDr$open()
remDr$maxWindowSize()
remDr$navigate(appURL)
UN <- remDr$findElement(using = 'xpath', "//*[@id='login-form']/div[2]/div[2]/input")
UN$sendKeysToElement(list("Username"))
PW <- remDr$findElement(using = 'xpath', "//*[@id='login-form']/div[2]/div[3]/input")
PW$sendKeysToElement(list("password", key = "enter"))
URL <- 'URL of page with data'
remDr$navigate(URL)
Sys.sleep(2)
Source <- remDr$getPageSource()[[1]]
Text <- read_xml(Source,encoding = "", as_html = F, options = "NOBLANKS") %>%
xml_text(trim = T)
Text <- unlist(Text)
At this point I have a lot of text that contains JSON but the JSON is organized like this:
event: optionCollection
id: 229
data: [{JSON}]
:
event: pageDescription
id: 230
data: [{JSON}]
:
event: dataTable.headerRows
id: 232
data: [{JSON}]
:
event: dataTable.dataRows
id: 233
data: [{JSON}]
Where the JSON I need is the JSON related to the event: dataTable.headerRows and dataTable.dataRows.
There are usually multiple dataRows events I need to extract data from.
Can anyone make a suggestion on how to get these into R?
Open to any suggestions or if you need more clarification please let me know.
Thanks!
*Edit - Added current library's per request.
*Edit - @Parfait this is what is returned:
$event
[1] "report.finished"
$id
[1] "2277"
$data
[1] "{\"status\":1}"
$event
[1] "report.finished"
$id
[1] "2277"
$data
[1] "{\"status\":1}"
This only appears in the tempfile.txt once so I am not clear what is causing the issue because it seems the code should work.
Here is the written tmpfile with the data substituted for the samples you provided in your post:
https://1drv.ms/t/s!AlEviX19YBNogaZGAHCUCC_ZDEI5OA
Upvotes: 1
Views: 129
Reputation: 107567
Consider reading the string line by line with readLines()
and building a list of event, id, and data items, where data will be json strings. But first dump your string to file with writeLines()
. The colon-only line is used as the separator between list elements and must appear as you have it:
writeLines(Text, "tempfile.txt") # CREATE TEMP FILE
con <- file("tempfile.txt", open="r") # OPEN CONNECTION
datalist <- c()
while (length(line <- readLines(con, n=1, warn = FALSE)) > 0) {
if (grepl("event:", line)==TRUE){
eventitem <- gsub("event: ", "", line) # EVENT LINE
}
else if (grepl("id:", line)==TRUE){
iditem <- gsub("id: ", "", line) # ID LINE
}
else if (grepl("data:", line)==TRUE){
dataitem <- gsub("data: ", "", line) # DATA LINE
}
else if (grepl("^:", line)==TRUE) {
# COLON ONLY-LINE (APPENDING NESTED LIST ITEMS)
datalist <- c(datalist, list(event=eventitem, id=iditem, data=dataitem))
}
else {
dataitem <- paste0(dataitem, gsub("data: ", "", line)) # ADD DATA LINES
}
}
# REMAINING LAST LIST ITEMS
datalist <- c(datalist, list(event=eventitem, id=iditem, data=dataitem))
close(con) # CLOSE CONNECTION
unlink("tempfile.txt") # DELETE TEMP FILE
Output (using an example repeat JSON)
$event
[1] "optionCollection"
$id
[1] "229"
$data
[1] "[{\"sales\": {\"sales_val\": 22549,\"units_in_stock\":
251,\"product_id\": \"0141602\"},\"sales\": {\"sales_val\":
22549,\"units_in_stock\": 251,\"product_id\": \"0141602\"},\"sales\":
{\"sales_val\": 22549,\"units_in_stock\": 251,\"product_id\": \"0141602\"}}]"
$event
[1] "pageDescription"
$id
[1] "230"
$data
[1] "[{\"sales\": {\"sales_val\": 22549,\"units_in_stock\":
251,\"product_id\": \"0141602\"},\"sales\": {\"sales_val\":
22549,\"units_in_stock\": 251,\"product_id\": \"0141602\"},\"sales\":
{\"sales_val\": 22549,\"units_in_stock\": 251,\"product_id\": \"0141602\"}}]"
$event
[1] "dataTable.headerRows"
$id
[1] "232"
$data
[1] "[{\"sales\": {\"sales_val\": 22549,\"units_in_stock\":
251,\"product_id\": \"0141602\"},\"sales\": {\"sales_val\":
22549,\"units_in_stock\": 251,\"product_id\": \"0141602\"},\"sales\":
{\"sales_val\": 22549,\"units_in_stock\": 251,\"product_id\": \"0141602\"}}]"
$event
[1] "dataTable.dataRows"
$id
[1] "233"
$data
[1] "[{\"sales\": {\"sales_val\": 22549,\"units_in_stock\":
251,\"product_id\": \"0141602\"},\"sales\": {\"sales_val\":
22549,\"units_in_stock\": 251,\"product_id\": \"0141602\"},\"sales\":
{\"sales_val\": 22549,\"units_in_stock\": 251,\"product_id\": \"0141602\"}}]"
Upvotes: 1