Jo zyd
Jo zyd

Reputation: 79

Use rvest to extract html table

I am a new learner for R, I am interested in using rvest to extract html table and submit html forms.

Now, I want to get some useful information from a Chinese Website. The url is:

http://caipiao.163.com/award/cqssc/20160513.html

I am using Windows 10 Professional with RStudio Version 0.99.896, I use Google Chrome as the web browser with XPATH helper addon.

I want to extract the main html table from the Chinese site, it contains 120 groups of information about the lottery winning number. The first one (001) is: 98446 and the last one (120) is: 01798; I want to extract only the numbers (001) to (120) and the winning numbers: 98446 to 01798.

I used XPATH helper and Chrome web development to get the XPATH.

I think the XPATH for the information I want is:

//html/body/article[@class='docBody clearfix']/section[@id='mainArea']/div[@class='lottery-results']/table[@class='awardList']/*[@id="mainArea"]/div[1]/table/tbody/tr[2]/td[1]

But when I run the following code in RStudio, I can not get the result I want. The following is my code:

> library(rvest)
Loading required package: xml2
> url <- "http://caipiao.163.com/award/cqssc/20160513.html"
> xp <- "//html/body/article[@class='docBody clearfix']/section    [@id='mainArea']/div[@class='lottery-results']/table[@class='awardList']/*[@id='mainArea']/div[1]/table/tbody/tr[2]/td[1]"
> 
> x <- read_html(url)
> y <- x %>% html_nodes(xpath=xp)
> y
{xml_nodeset (0)}

>

Please take a look at my code and let me know if I made any mistakes. You can simply ignore those unknown Chinese characters, they are not important, I just want to get the numbers.

Thanks! John

Upvotes: 2

Views: 5078

Answers (2)

hrbrmstr
hrbrmstr

Reputation: 78832

It's not necessary to use such a precise target selector since there's only one table element (as the other answerer also pointed out). But you don't need to leave rvest behind:

library(rvest)

URL <- "http://caipiao.163.com/award/cqssc/20160513.html"

pg <- read_html(URL)
tab <- html_table(pg, fill=TRUE)[[1]]

str(tab)

## 'data.frame': 40 obs. of  39 variables:
##  $ 期号    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ 开奖号码: chr  "9 8 4 4 6" "1 8 3 1 6" "2 9 3 5 6" "1 4 5 8 0" ...
##  ....

(SO is interpreting some of the unicode glyphs as spam so I had to remove the other columns).

The second column gets compressed via post-page-load javascript actions, so you'll need to clean that up a bit if that's the one you're going for.

Upvotes: 5

agustin
agustin

Reputation: 1351

I would use the function readHTMLTable from package XML to get the whole table, as in your website there is only one <table> element:

install.packages("XML)
library(XML)
url <- "http://caipiao.163.com/award/cqssc/20160513.html"
lotteryResults <- as.data.frame(readHTMLTable(url)) 

Then you can just do some cleansing procedures, subsetting and using rbind to get a data.frame with 2 columns and 120 observations.

Upvotes: 0

Related Questions