Clement
Clement

Reputation: 11

C++ Screen Scraping from HTML

i'm trying to extract the data "Lady Gaga Fame Monster" from the html below using substr and find, but i wasn't able to retrieve the data.

<div class="album-name"><strong>Album</strong> > Lady Gaga Fame Monster</div>

I'm tried to extract the whole string first, but i can only extract till Album under the command cout << line_found , as there's spacing that prevents it from proceeding further.

I try cout << extract_line. I see no spaces in the extracted html code.

I tried the tutorial based from this http://www.cplusplus.com/reference/string/string/substr/, it works, even with spaces. I'm following closely but it stops extracting once it hit spaces. Pls help really appreciated. thanks. Figuring out 2 days without any solution.

here's the source code:

#include "parser.h"
#include <stdlib.h>
#include <iostream>
#include <fstream>
#include <string>
#include <cstring>

using namespace std;

int main() {

    string line_found, extract_line, result, finalResult="";
    int firstPosition, secondPosition, input, location;

    ifstream sourceFile ("cd1.htm"); // extracts from sourcefile

    while(!sourceFile.eof())
    {
        sourceFile >> extract_line;
        location = extract_line.find("album-name");
       // cout << extract_line;

       if (location >=0)
       {       
            line_found = extract_line.substr(location);
            cout << line_found << endl;
            firstPosition= line_found.find_first_of(">");

            result = line_found.substr(firstPosition);

       }
    }    
    return 0;
}

Upvotes: 1

Views: 3609

Answers (2)

obelix
obelix

Reputation: 986

Another lightweight and simple option could be to use a regex. VS2010 and VS2008 (SP1 IIRC) come with the #include header that should allow much more control and flexibility than your approach.

It wouldn't be as robust as Marcelo's approach but would be quicker to get started with.

Upvotes: 0

Marcelo Cantos
Marcelo Cantos

Reputation: 186118

The >> operator doesn't fetch lines. It fetches whitespace-separated tokens. Use std::getline (see here) instead.

Better still, don't use string searching tools to parse HTML. It's a disaster waiting to happen. In fact, it's happening to you right now. Note that there is more than one instance of > in your line, so you will probably find the wrong one and get yourself in a complete muddle trying to skip all the ones that don't matter (you could try looking for " > ", but what if you encounter this: ...class="album-name" > <strong>..., which is perfectly valid HTML.

If the HTML is proper XHTML, use an XML parser instead. Expat, for instance, is small, fast and (relatively) simple to use. You can find a nice, easy intro here.

If the HTML is messy, you're going to struggle with C++. There's a related SO question here. Alternatively, use a language with a good HTML library such as Python (Beautiful Soup), which you can call from C++.

Upvotes: 6

Related Questions