Reputation: 57
I am using the following code to get information from a web site (http://q.stock.sohu.com/cn/000002/lshq.shtml). But I do not know how to get a data frame which includes "date,open,close,high,low". Any help would be appreciated.
thepage = readLines('http://q.stock.sohu.com/hisHq?code=cn_000002&start=20131120&end=20140318&stat=1&order=D&period=d&callback=historySearchHandler&rt=jsonp&r=0.8740235545448934&0.28161772061461654')
How can I get the data frame?
Upvotes: 0
Views: 265
Reputation: 78842
I don't know which parts of the return JSON are the actual values you need, but I assume they are components of the hq
record. This should work:
library(RJSONIO)
library(RCurl)
# get the raw data
dat.json.raw <- getURL("http://q.stock.sohu.com/hisHq?code=cn_000002&start=20131120&end=20140318&stat=1&order=D&period=d&callback=historySearchHandler&rt=jsonp&r=0.8740235545448934&0.28161772061461654%27")
tt <- textConnection(dat.json.raw)
dat.json <- readLines(tt)
close(tt)
# remove callback
dat.json <- gsub("^historySearchHandler\\(", "", dat.json)
dat.json <- gsub("\\)$", "", dat.json)
# convert to R structure
dat.l <- fromJSON(dat.json)
# get the meaty part of the data into a data.frame
dat <- data.frame(t(sapply(dat.l[[1]]$hq, unlist)), stringsAsFactors=FALSE)
dat$X1 <- as.Date(dat$X1)
dat$X2 <- as.numeric(dat$X2)
dat$X3 <- as.numeric(dat$X3)
dat$X4 <- as.numeric(dat$X4)
str(dat)
## 'data.frame': 79 obs. of 10 variables:
## $ X1 : Date, format: "2014-03-18" "2014-03-17" "2014-03-14" ...
## $ X2 : num 7.76 7.6 7.68 7.58 7.48 7.19 7.22 7.34 6.76 6.92 ...
## $ X3 : num 7.6 7.76 7.53 7.71 7.6 7.5 7.15 7.27 7.32 6.76 ...
## $ X4 : num -0.16 0.23 -0.18 0.11 0.1 0.35 -0.12 -0.05 0.56 -0.16 ...
## $ X5 : chr "-2.06%" "3.05%" "-2.33%" "1.45%" ...
## $ X6 : chr "7.55" "7.59" "7.50" "7.53" ...
## $ X7 : chr "7.76" "7.80" "7.81" "7.85" ...
## $ X8 : chr "843900" "1177079" "1303110" "1492359" ...
## $ X9 : chr "64268.06" "90829.30" "99621.34" "114990.40" ...
## $ X10: chr "0.87%" "1.22%" "1.35%" "1.54%" ...
head(dat)
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
## 1 2014-03-18 7.76 7.60 -0.16 -2.06% 7.55 7.76 843900 64268.06 0.87%
## 2 2014-03-17 7.60 7.76 0.23 3.05% 7.59 7.80 1177079 90829.30 1.22%
## 3 2014-03-14 7.68 7.53 -0.18 -2.33% 7.50 7.81 1303110 99621.34 1.35%
## 4 2014-03-13 7.58 7.71 0.11 1.45% 7.53 7.85 1492359 114990.40 1.54%
## 5 2014-03-12 7.48 7.60 0.10 1.33% 7.42 7.85 2089873 160315.88 2.16%
## 6 2014-03-11 7.19 7.50 0.35 4.90% 7.15 7.59 1892488 141250.94 1.96%
You'll need to fix some of the other columns (since I don't know exactly what you need).
For folks who don't like the warnings that come back from the fromJSON
call, you can just wrap the readLines
with a paste
: dat.json <- paste(readLines(tt), collapse="")
. It's not necessary (the warnings are harmless) so I don't usually bother with the extra step.
Upvotes: 1
Reputation: 13100
Seems like you're trying to scrape a website that presents the data in JSON.
For that, in addition to the "usual steps" that you need to do in order to scrape a website you'll also need to deal with parsing and manipulating JSON data:
If you have a HTML that has an easy to grab table, this should work:
require("XML")
x <- readHTMLTable(
doc="swww.someurl.com"
)
Otherwise you'll definitely need to use some XPath to get to the nodes that you're interested in.
This usually involves parsing the HTML code via htmlTreeParse()
and getting to the respective nodes via getNodeSet()
and the like:
x <- htmlTreeParse(
file="swww.someurl.com",
isURL=TRUE,
useInternalNodes=TRUE
)
res <- getNodeSet(x, <your-xpath-statement>)
Parse the HTML code:
x <- htmlTreeParse(
file="http://q.stock.sohu.com/hisHq?code=cn_000002&start=20131120&end=20140318&stat=1&order=D&period=d&callback=historySearchHandler&rt=jsonp&r=0.8740235545448934&0.28161772061461654",
isURL=TRUE,
useInternalNodes=TRUE
)
Retrieve the actual JSON data:
json <- getNodeSet(x, "//body/p")
json <- xmlValue(json[[1]])
Get rid of non-JSON components:
json <- gsub("historySearchHandler\\(", "", json, perl=TRUE)
json <- gsub("\\)$", "", json, perl=TRUE)
Parse the JSON data:
require("jsonlite")
fromJSON(json, simplifyVector=FALSE)
[[1]]
[[1]]$status
[1] 0
[[1]]$hq
[[1]]$hq[[1]]
[[1]]$hq[[1]][[1]]
[1] "2014-03-18"
[[1]]$hq[[1]][[2]]
[1] "7.76"
[...]
Now you need to bring that into a more data.frame
-like order (methods that come to mind are do.call()
, rbind()
, cbind
).
Sooner or later (rather sooner than later as we see in this very example), you'll be confronted with encoding issues (stuff like "ÀÛ¼Æ:"
).
You can play around with different encodings either directly when parsing the HTML code (argument encoding
in htmlTreeParse()
) or modify a character string's encoding via Encoding
"afterwards". I wasn't able to get it all correct for your values, though. Encoding issues can be quite a pain.
I'd recommend you to choose english-based examples (an english-based website in this case) in the future as otherwise you're tremendously limiting the amount of people that might be able to help you.
Upvotes: 1