Reputation: 13
I'm developing an ASP code that read a external websites and parse it via HTMLDocument interface Object ( "HTMLFILE" Object) to navigate contents via DOM structure. But there are some pages that throw an error :
'htmlfile error 80070057 Invalid Argument.'
After doing a lot of research, I've discovered that there are some HTML tags that, i don't know why, are not rendered or managed correctly by HTMLFILE object giving me that error.
Because ASP is too old and there isn't much content available today to be probing, I'm convinced that I have to parse it before send to HTMLFILE Object, and the best way that I have figured is to do via RegEx.
But I'm facing some problems (and because i don't have much practice).
I have to successfully locate HTML Tag Blocks that 'HTMLFILE' do not accept to be able to remove them.
For Example:
<head>
<script> ....... </script>
<style> ....... </style>
</head>
<body>
<iframe> ........ </iframe>
<div> ..... </div>
<table>.....</table>
I have to match full script block, style and iframe, leaving the rest of document intact.
From last days i've doing some research and have almost done it:
<(?:script|embed|object|frameset|frame|iframe|meta|style).+(.|\s)*?>$
I've tried to match single line tag (for example '<BR>') but I'm totally confused now and there are some inconsistencies on it, for example, some of lines that close some tags are improperly selected.
I Know that the best way is discover why HTMLFILE is throwing me on error, but there is no more information on error to debug it.
Thank for all the time and patience.
Upvotes: 1
Views: 2370
Reputation: 5395
Regex is not best tool for that, take a look here, but if you really want, I think in simple cases you can also use one regex for all tags, also nested:
(?=(<([^>]+)>([\s\S]*?)<\/\2>))
the 1st groups shows whole captured part, 2nd groups capture just tag, and 3rd group capture content of tag. It doesn't actually match text, only capture some fragments. However you probably can get start/end index of match, and use in as you want.
Still I think you should reconsider using regex, however suntex used above is quite useful, so it is worth to know how to use it.
Upvotes: 0