user3394975
user3394975

Reputation: 57

How to scrape data from a web site?

I am using the following code to get information from a web site (http://q.stock.sohu.com/cn/000002/lshq.shtml). But I do not know how to get a data frame which includes "date,open,close,high,low". Any help would be appreciated.

thepage = readLines('http://q.stock.sohu.com/hisHq?code=cn_000002&start=20131120&end=20140318&stat=1&order=D&period=d&callback=historySearchHandler&rt=jsonp&r=0.8740235545448934&0.28161772061461654')

How can I get the data frame?

Upvotes: 0

Views: 265

Answers (2)

hrbrmstr
hrbrmstr

Reputation: 78842

I don't know which parts of the return JSON are the actual values you need, but I assume they are components of the hq record. This should work:

library(RJSONIO)
library(RCurl)

# get the raw data
dat.json.raw <- getURL("http://q.stock.sohu.com/hisHq?code=cn_000002&start=20131120&end=20140318&stat=1&order=D&period=d&callback=historySearchHandler&rt=jsonp&r=0.8740235545448934&0.28161772061461654%27")
tt <- textConnection(dat.json.raw)
dat.json <- readLines(tt)
close(tt)

# remove callback
dat.json <- gsub("^historySearchHandler\\(", "", dat.json)
dat.json <- gsub("\\)$", "", dat.json)

# convert to R structure
dat.l <- fromJSON(dat.json)

# get the meaty part of the data into a data.frame
dat <- data.frame(t(sapply(dat.l[[1]]$hq, unlist)), stringsAsFactors=FALSE)
dat$X1 <- as.Date(dat$X1)
dat$X2 <- as.numeric(dat$X2)
dat$X3 <- as.numeric(dat$X3)
dat$X4 <- as.numeric(dat$X4)

str(dat)
## 'data.frame':    79 obs. of  10 variables:
##  $ X1 : Date, format: "2014-03-18" "2014-03-17" "2014-03-14" ...
##  $ X2 : num  7.76 7.6 7.68 7.58 7.48 7.19 7.22 7.34 6.76 6.92 ...
##  $ X3 : num  7.6 7.76 7.53 7.71 7.6 7.5 7.15 7.27 7.32 6.76 ...
##  $ X4 : num  -0.16 0.23 -0.18 0.11 0.1 0.35 -0.12 -0.05 0.56 -0.16 ...
##  $ X5 : chr  "-2.06%" "3.05%" "-2.33%" "1.45%" ...
##  $ X6 : chr  "7.55" "7.59" "7.50" "7.53" ...
##  $ X7 : chr  "7.76" "7.80" "7.81" "7.85" ...
##  $ X8 : chr  "843900" "1177079" "1303110" "1492359" ...
##  $ X9 : chr  "64268.06" "90829.30" "99621.34" "114990.40" ...
##  $ X10: chr  "0.87%" "1.22%" "1.35%" "1.54%" ...

head(dat)
##           X1   X2   X3    X4     X5   X6   X7      X8        X9   X10
## 1 2014-03-18 7.76 7.60 -0.16 -2.06% 7.55 7.76  843900  64268.06 0.87%
## 2 2014-03-17 7.60 7.76  0.23  3.05% 7.59 7.80 1177079  90829.30 1.22%
## 3 2014-03-14 7.68 7.53 -0.18 -2.33% 7.50 7.81 1303110  99621.34 1.35%
## 4 2014-03-13 7.58 7.71  0.11  1.45% 7.53 7.85 1492359 114990.40 1.54%
## 5 2014-03-12 7.48 7.60  0.10  1.33% 7.42 7.85 2089873 160315.88 2.16%
## 6 2014-03-11 7.19 7.50  0.35  4.90% 7.15 7.59 1892488 141250.94 1.96%

You'll need to fix some of the other columns (since I don't know exactly what you need).

For folks who don't like the warnings that come back from the fromJSON call, you can just wrap the readLines with a paste : dat.json <- paste(readLines(tt), collapse=""). It's not necessary (the warnings are harmless) so I don't usually bother with the extra step.

Upvotes: 1

Rappster
Rappster

Reputation: 13100

Seems like you're trying to scrape a website that presents the data in JSON.

For that, in addition to the "usual steps" that you need to do in order to scrape a website you'll also need to deal with parsing and manipulating JSON data:

Usual approach

If you have a HTML that has an easy to grab table, this should work:

require("XML")
x <- readHTMLTable(
    doc="swww.someurl.com"
)

Otherwise you'll definitely need to use some XPath to get to the nodes that you're interested in.

This usually involves parsing the HTML code via htmlTreeParse() and getting to the respective nodes via getNodeSet() and the like:

x <- htmlTreeParse(
    file="swww.someurl.com",
    isURL=TRUE,
    useInternalNodes=TRUE
)

res <- getNodeSet(x, <your-xpath-statement>)

Approach including JSON data

Parse the HTML code:

x <- htmlTreeParse(
    file="http://q.stock.sohu.com/hisHq?code=cn_000002&start=20131120&end=20140318&stat=1&order=D&period=d&callback=historySearchHandler&rt=jsonp&r=0.8740235545448934&0.28161772061461654",
    isURL=TRUE,
    useInternalNodes=TRUE
)

Retrieve the actual JSON data:

json <- getNodeSet(x, "//body/p")
json <- xmlValue(json[[1]])

Get rid of non-JSON components:

json <- gsub("historySearchHandler\\(", "", json, perl=TRUE)
json <- gsub("\\)$", "", json, perl=TRUE)

Parse the JSON data:

require("jsonlite")
fromJSON(json, simplifyVector=FALSE)

[[1]]
[[1]]$status
[1] 0

[[1]]$hq
[[1]]$hq[[1]]
[[1]]$hq[[1]][[1]]
[1] "2014-03-18"

[[1]]$hq[[1]][[2]]
[1] "7.76"

[...]

Now you need to bring that into a more data.frame-like order (methods that come to mind are do.call(), rbind(), cbind).

Encoding

Sooner or later (rather sooner than later as we see in this very example), you'll be confronted with encoding issues (stuff like "ÀÛ¼Æ:").

You can play around with different encodings either directly when parsing the HTML code (argument encoding in htmlTreeParse()) or modify a character string's encoding via Encoding "afterwards". I wasn't able to get it all correct for your values, though. Encoding issues can be quite a pain.

General suggestion

I'd recommend you to choose english-based examples (an english-based website in this case) in the future as otherwise you're tremendously limiting the amount of people that might be able to help you.

Upvotes: 1

Related Questions