chhenning
chhenning

Reputation: 2077

Programmatically extracting tables from html files with c/c++

I'm looking for better ideas for extracting tables from html files. Right now I'm using tidy ( http://tidy.sourceforge.net/ ) to convert a html file into xhtml and then I use rapidxml to parse the xml. While parsing I will look for <table>, <tr>, and <td> nodes and so create my table data structures.

It works quite nicely but I'm wondering if there are better ways to accomplish my task. Also the tidy lib seems like an abandoned project.

Also has everyone ever tried the "experimental" patch in tidy source code?

Thanks, Christian

Upvotes: 3

Views: 2242

Answers (2)

Hamed Masafi
Hamed Masafi

Reputation: 304

You can use htmlparser (https://github.com/HamedMasafi/htmlparser) This lib can parse, read and modify html and css

For example, in your case for reading of table


    html_parser html;
    html.set_text(html_text);
    auto table = html.query("#table_id").at(0);
    for (auto tr : table->childs()) {
        for (auto td : tr->childs()) {
            //now here you have a td and you are free to any modify are data read
            //e.g:
            auto td_tag = dynamic_cast<html_tag*>(td);
            td_tag->set_attr("id", "new_id"); // change attr
            auto id = td_tag->attr("id");
            auto test = td_tag->innser_text();
            auto html = td_tag->outter_html();
        }
    }

The quick start sample is here

Upvotes: 1

Jonas
Jonas

Reputation: 131

I think your approach is quite ok. I think the best is to tidy and convert the html to xhtml and parse the xml. Cannot see how it can be simplified.

You did not mention any problems so I am not sure what the issue is.

Upvotes: 0

Related Questions