Reputation: 5110
I need to remove HTML tags from a string:
std::String whole_file("<imgxyz width=139\nheight=82 id=\"_x0000_i1034\" \n src=\"cid:[email protected]\" \nalign=baseline border=0> \ndfdsf");
When I use RE2 library for pattern remove
RE2::GlobalReplace(&whole_file,"<.*?>"," ");
The Html Tags are not removed, when i use
RE2::GlobalReplace(&whole_file,"<.*\n.*\n.*?>"," ");
The html tags are removed, why is it so .. can any one suggest a better regular expression to remove HTML tags from a file?
Upvotes: 1
Views: 2158
Reputation: 11558
Check pattern: <[^>]*>
Sample code:
#include <string.h>
#include <string>
#include <stdio.h>
#include <vector>
#include <regex>
int main()
{
//Find all html codes
std::regex htmlCodes("<[^>]*>");
std::cmatch matches;
const char* nativeString = "asas<td cl<asas> ass=\"played\">0</td><td class=\"played\">";
int offset = 0;
while(std::regex_search ( nativeString + offset, matches, htmlCodes ))
{
if(matches.size() < 1)
{
break;
}
for (unsigned i=0; i<matches.size(); ++i)
{
const int position = matches.position(i) + offset;
printf("Found: %s %d %ld\n",matches[i].str().c_str(),position,matches.length(i));
offset = position + matches.length(i);
}
}
return 0;
}
Output:
Found: <td cl<asas> 4 12
Found: </td> 31 5
Found: <td class="played"> 36 19
Upvotes: 0
Reputation: 299999
Wild guess: .
does not match the EOL character.
You could use: "<[.\n]*?>"
to match any number of newline character.
Upvotes: 2