younghyun
younghyun

Reputation: 341

Why can't I read this html by using html_table function?

maybe same problem : R - Scraping an HTML table with rvest when there are missing <tr> tags

This is data. The extension of the data is given as an XLS file. I can read this data by read_html, not read_xml

When I use read_xml I get an error. error message is 'Opening and ending tag mismatch: tbody line 41 and tr [76]'

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html
    xmlns="http://www.w3.org/1999/xhtml" lang="ko">
    <head>
        <meta http-equiv="X-UA-Compatible" content="IE=edge">
            <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
                <meta http-equiv="Content-Script-Type" content="text/javascript">
                    <meta http-equiv="Content-Style-Type" content="text/css">
                        <title></title>
                        <style type="text/css">
        td {
            text-align: center;
        }
    </style>
                    </head>
                    <body>
                        <table cellpadding="0" cellspacing="0" border="1" summary="summary">
                            <thead>
                                <tr>
                                    <th>v1</th>
                                    <th>v2</th>
                                    <th>v3</th>
                                    <th>v4</th>
                                    <th>v5</th>
                                </tr>
                            </thead>
                            <tbody>
                                <td>aa1</td>
                                <td>aa2</td>
                                <td>aa3</td>
                                <td>aa4</td>
                                <td>aa5</td>
                    </tr>

                            </tbody>
                        </table>
                    </body>
                </html>

my code and result

tmp<-read_html('file_name') %>% html_table()
> [[1]]
[1] V1 V2 V3 V4 V5
<0 rows> (or 0-length row.names).

Why can't it read the 'td'?

desired Output

v1  v2  v3  v4  v5
aa1 aa2 aa3 aa4 aa5

Also,

tmp %>% html_nodes('table') %>% htmlTableWidget 

This code properly recognizes dataframes. But I need the dataframe result, not the widget.

Upvotes: 0

Views: 156

Answers (2)

younghyun
younghyun

Reputation: 341

html integrity test link: https://developer.mozilla.org/ko/docs/Web/HTML/Element/td

for(i in 1:length(files.list)){
  page <- read_html(files.list[i])

  col_name<- page %>%
    html_nodes("th") %>%
    html_text()
  
  mydata <- page %>%
  html_nodes("td") %>%
  html_text()

  finaldata <- data.frame(matrix(c(col_name,mydata), ncol=length(col_name), byrow=TRUE))
  print(i)
  # print(length(mydata))
  # print(length(col_name))
  print('---------')
  print('---------')
  
  vname <- paste0('df_',i,fname[i])
  assign(vname, finaldata)
}

Upvotes: 0

QHarr
QHarr

Reputation: 84465

You are missing <tr></tr> tags around the last td so that is throwing the html parser.

You would see this if using an html validation tool. E.g. with https://validator.w3.org/check

enter image description here

There are a variety of ways to address this. A simple way would be to use a package that fixes html such as htmltidy.

library(rvest)
library(magrittr)
library(htmltidy)

page <- read_html('<file_path>')

htmltidy::tidy_html(page) %>% html_table()

Upvotes: 1

Related Questions