Reputation: 11
i'm trying to extract the data "Lady Gaga Fame Monster" from the html below using substr and find, but i wasn't able to retrieve the data.
<div class="album-name"><strong>Album</strong> > Lady Gaga Fame Monster</div>
I'm tried to extract the whole string first, but i can only extract till Album under the command cout << line_found
, as there's spacing that prevents it from proceeding further.
I try cout << extract_line
. I see no spaces in the extracted html code.
I tried the tutorial based from this http://www.cplusplus.com/reference/string/string/substr/, it works, even with spaces. I'm following closely but it stops extracting once it hit spaces. Pls help really appreciated. thanks. Figuring out 2 days without any solution.
here's the source code:
#include "parser.h"
#include <stdlib.h>
#include <iostream>
#include <fstream>
#include <string>
#include <cstring>
using namespace std;
int main() {
string line_found, extract_line, result, finalResult="";
int firstPosition, secondPosition, input, location;
ifstream sourceFile ("cd1.htm"); // extracts from sourcefile
while(!sourceFile.eof())
{
sourceFile >> extract_line;
location = extract_line.find("album-name");
// cout << extract_line;
if (location >=0)
{
line_found = extract_line.substr(location);
cout << line_found << endl;
firstPosition= line_found.find_first_of(">");
result = line_found.substr(firstPosition);
}
}
return 0;
}
Upvotes: 1
Views: 3609
Reputation: 986
Another lightweight and simple option could be to use a regex. VS2010 and VS2008 (SP1 IIRC) come with the #include header that should allow much more control and flexibility than your approach.
It wouldn't be as robust as Marcelo's approach but would be quicker to get started with.
Upvotes: 0
Reputation: 186118
The >>
operator doesn't fetch lines. It fetches whitespace-separated tokens. Use std::getline
(see here) instead.
Better still, don't use string searching tools to parse HTML. It's a disaster waiting to happen. In fact, it's happening to you right now. Note that there is more than one instance of >
in your line, so you will probably find the wrong one and get yourself in a complete muddle trying to skip all the ones that don't matter (you could try looking for " > "
, but what if you encounter this: ...class="album-name" > <strong>...
, which is perfectly valid HTML.
If the HTML is proper XHTML, use an XML parser instead. Expat, for instance, is small, fast and (relatively) simple to use. You can find a nice, easy intro here.
If the HTML is messy, you're going to struggle with C++. There's a related SO question here. Alternatively, use a language with a good HTML library such as Python (Beautiful Soup), which you can call from C++.
Upvotes: 6