Reputation: 33
I want to scrape this website, and get the data from the table.
I use GET
from the package httr
, code is like below:
url <- 'http://datacenter.mep.gov.cn/report/water/water.jsp?'
year <- 2016
wissue <- 2
res <- GET(url,
query = list(year = year,
wissue = wissue))
resC <- content(res, as = 'text', encoding = 'utf-8')
But what I got is not a json string but something very strange like below:
"\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n<html>\r\n\t<head>\r\n\t\t<title>中华人民共和国环境保护部--政府网站数据中心</title>\r\n\t\t<meta http-equiv=\"content-type\" content=\"text/html;
I wonder is there anyway to parse this format?
Upvotes: 3
Views: 1629
Reputation: 78792
The rowspan
attribute is going to make dealing with this table pretty interesting. You have a few choices, two of which are:
html_table()
on the target <table>
using fill=TRUE
and perform surgery on the resultant data frame<tr>
-level and build the data frame from the ground upThis answer does the latter.
library(rvest)
library(purrr)
First, we get the content in a form we can perform XML/HTML surgery on:
content(res, as = 'text', encoding = 'utf-8') %>%
read_html() -> pg
Next, we target and extract the table node with the report:
tab <- html_nodes(pg, "table#report1")
Here's te tricky bit. We first target all the <tr>
elements that have @rowspan
attributes but no <td>
elements with a @colspan
attribute:
html_nodes(tab, xpath=".//tr[td[not(@colspan) and @rowspan]]") %>%
Next, we process those invidivually:
map_df(function(x) {
We get the # of rows the <tr>
spans:
html_nodes(x, xpath=".//td[@rowspan]") %>%
html_attr("rowspan") %>%
as.numeric() -> row_ct
Find all the sibling <tr>
elements and reduce the set to the remaining ones in this <tr>
"block":
rows <- html_nodes(x, xpath=".//following-sibling::tr")
rows <- rows[1:(row_ct-1)]
Make a data frame from that first block row
html_nodes(x, xpath=".//td") %>%
html_text() %>%
setNames(sprintf("X%d", 1:13)) %>%
as.list() %>%
flatten_df() -> first
Go through all filtered sibling rows and do the same, leaving room to fill in the spanned column:
map_df(rows, ~html_nodes(., xpath=".//td") %>%
html_text() %>%
setNames(c("X1", "X2", sprintf("X%d", 4:13))) %>%
as.list()) %>%
mutate(X3=first$X3) %>%
select(X1, X2, X3, everything()) -> rest
bind_rows(first, rest)
}) -> h2o_df
dplyr::glimpse(h2o_df)
I can't paste the output of that since SO's javascript text filter is so brain dead it thinks that the post is spam just b/c it has kanji characters.
Here's all the code in a contiguous chunk:
tab <- html_nodes(pg, "table#report1")
html_nodes(tab, xpath=".//tr[td[not(@colspan) and @rowspan]]") %>%
map_df(function(x) {
html_nodes(x, xpath=".//td[@rowspan]") %>%
html_attr("rowspan") %>%
as.numeric() -> row_ct
rows <- html_nodes(x, xpath=".//following-sibling::tr")
rows <- rows[1:(row_ct-1)]
html_nodes(x, xpath=".//td") %>%
html_text() %>%
setNames(sprintf("X%d", 1:13)) %>%
as.list() %>%
flatten_df() -> first
map_df(rows, ~html_nodes(., xpath=".//td") %>%
html_text() %>%
setNames(c("X1", "X2", sprintf("X%d", 4:13))) %>%
as.list()) %>%
mutate(X3=first$X3) %>%
select(X1, X2, X3, everything()) -> rest
bind_rows(first, rest)
}) -> h2o_df
Upvotes: 2