Reputation: 79
I am a new learner for R, I am interested in using rvest to extract html table and submit html forms.
Now, I want to get some useful information from a Chinese Website. The url is:
http://caipiao.163.com/award/cqssc/20160513.html
I am using Windows 10 Professional with RStudio Version 0.99.896, I use Google Chrome as the web browser with XPATH helper addon.
I want to extract the main html table from the Chinese site, it contains 120 groups of information about the lottery winning number. The first one (001) is: 98446 and the last one (120) is: 01798; I want to extract only the numbers (001) to (120) and the winning numbers: 98446 to 01798.
I used XPATH helper and Chrome web development to get the XPATH.
I think the XPATH for the information I want is:
//html/body/article[@class='docBody clearfix']/section[@id='mainArea']/div[@class='lottery-results']/table[@class='awardList']/*[@id="mainArea"]/div[1]/table/tbody/tr[2]/td[1]
But when I run the following code in RStudio, I can not get the result I want. The following is my code:
> library(rvest)
Loading required package: xml2
> url <- "http://caipiao.163.com/award/cqssc/20160513.html"
> xp <- "//html/body/article[@class='docBody clearfix']/section [@id='mainArea']/div[@class='lottery-results']/table[@class='awardList']/*[@id='mainArea']/div[1]/table/tbody/tr[2]/td[1]"
>
> x <- read_html(url)
> y <- x %>% html_nodes(xpath=xp)
> y
{xml_nodeset (0)}
>
Please take a look at my code and let me know if I made any mistakes. You can simply ignore those unknown Chinese characters, they are not important, I just want to get the numbers.
Thanks! John
Upvotes: 2
Views: 5078
Reputation: 78832
It's not necessary to use such a precise target selector since there's only one table
element (as the other answerer also pointed out). But you don't need to leave rvest
behind:
library(rvest)
URL <- "http://caipiao.163.com/award/cqssc/20160513.html"
pg <- read_html(URL)
tab <- html_table(pg, fill=TRUE)[[1]]
str(tab)
## 'data.frame': 40 obs. of 39 variables:
## $ 期号 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ 开奖号码: chr "9 8 4 4 6" "1 8 3 1 6" "2 9 3 5 6" "1 4 5 8 0" ...
## ....
(SO is interpreting some of the unicode glyphs as spam so I had to remove the other columns).
The second column gets compressed via post-page-load javascript actions, so you'll need to clean that up a bit if that's the one you're going for.
Upvotes: 5
Reputation: 1351
I would use the function readHTMLTable
from package XML
to get the whole table, as in your website there is only one <table>
element:
install.packages("XML)
library(XML)
url <- "http://caipiao.163.com/award/cqssc/20160513.html"
lotteryResults <- as.data.frame(readHTMLTable(url))
Then you can just do some cleansing procedures, subsetting and using rbind to get a data.frame with 2 columns and 120 observations.
Upvotes: 0