Reputation: 341
maybe same problem : R - Scraping an HTML table with rvest when there are missing <tr> tags
This is data. The extension of the data is given as an XLS file. I can read this data by read_html, not read_xml
When I use read_xml I get an error. error message is 'Opening and ending tag mismatch: tbody line 41 and tr [76]'
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html
xmlns="http://www.w3.org/1999/xhtml" lang="ko">
<head>
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="Content-Script-Type" content="text/javascript">
<meta http-equiv="Content-Style-Type" content="text/css">
<title></title>
<style type="text/css">
td {
text-align: center;
}
</style>
</head>
<body>
<table cellpadding="0" cellspacing="0" border="1" summary="summary">
<thead>
<tr>
<th>v1</th>
<th>v2</th>
<th>v3</th>
<th>v4</th>
<th>v5</th>
</tr>
</thead>
<tbody>
<td>aa1</td>
<td>aa2</td>
<td>aa3</td>
<td>aa4</td>
<td>aa5</td>
</tr>
</tbody>
</table>
</body>
</html>
my code and result
tmp<-read_html('file_name') %>% html_table()
> [[1]]
[1] V1 V2 V3 V4 V5
<0 rows> (or 0-length row.names).
Why can't it read the 'td'?
desired Output
v1 v2 v3 v4 v5
aa1 aa2 aa3 aa4 aa5
Also,
tmp %>% html_nodes('table') %>% htmlTableWidget
This code properly recognizes dataframes. But I need the dataframe result, not the widget.
Upvotes: 0
Views: 156
Reputation: 341
html integrity test link: https://developer.mozilla.org/ko/docs/Web/HTML/Element/td
for(i in 1:length(files.list)){
page <- read_html(files.list[i])
col_name<- page %>%
html_nodes("th") %>%
html_text()
mydata <- page %>%
html_nodes("td") %>%
html_text()
finaldata <- data.frame(matrix(c(col_name,mydata), ncol=length(col_name), byrow=TRUE))
print(i)
# print(length(mydata))
# print(length(col_name))
print('---------')
print('---------')
vname <- paste0('df_',i,fname[i])
assign(vname, finaldata)
}
Upvotes: 0
Reputation: 84465
You are missing <tr></tr>
tags around the last td
so that is throwing the html parser.
You would see this if using an html validation tool. E.g. with https://validator.w3.org/check
There are a variety of ways to address this. A simple way would be to use a package that fixes html such as htmltidy
.
library(rvest)
library(magrittr)
library(htmltidy)
page <- read_html('<file_path>')
htmltidy::tidy_html(page) %>% html_table()
Upvotes: 1