Reputation: 2077
I'm looking for better ideas for extracting tables from html files. Right now I'm using tidy ( http://tidy.sourceforge.net/ ) to convert a html file into xhtml and then I use rapidxml to parse the xml. While parsing I will look for <table>
, <tr>
, and <td>
nodes and so create my table data structures.
It works quite nicely but I'm wondering if there are better ways to accomplish my task. Also the tidy lib seems like an abandoned project.
Also has everyone ever tried the "experimental" patch in tidy source code?
Thanks, Christian
Upvotes: 3
Views: 2242
Reputation: 304
You can use htmlparser (https://github.com/HamedMasafi/htmlparser) This lib can parse, read and modify html and css
For example, in your case for reading of table
html_parser html;
html.set_text(html_text);
auto table = html.query("#table_id").at(0);
for (auto tr : table->childs()) {
for (auto td : tr->childs()) {
//now here you have a td and you are free to any modify are data read
//e.g:
auto td_tag = dynamic_cast<html_tag*>(td);
td_tag->set_attr("id", "new_id"); // change attr
auto id = td_tag->attr("id");
auto test = td_tag->innser_text();
auto html = td_tag->outter_html();
}
}
The quick start sample is here
Upvotes: 1
Reputation: 131
I think your approach is quite ok. I think the best is to tidy and convert the html to xhtml and parse the xml. Cannot see how it can be simplified.
You did not mention any problems so I am not sure what the issue is.
Upvotes: 0