Reputation: 1235
Let's say I have some thousands of HTML files with some text inside 'em (articles, actually). Besides, let's say there are all sorts of scripts, styles, counters, other crap inside these HTMLs, somewhere above the actual text.
And my task is to replace everything that goes from the very beginning until a certain tag – i.e., we start with <head>
and end with <div class="StoryGoesBelow">
with a clear
<html>
<head>
</head>
<body>
block.
Is there any regex way I can do this? Vim? Any other editor? Scripting language?
Thanks.
Upvotes: 0
Views: 57
Reputation: 336108
The simplest regex for this would be (?s)\A.*?(?=<div class="StoryGoesBelow">)
(assuming you want to keep the <div>
tag). Replace that with the text from your question.
Explanation:
(?s) # Allow the dot to match newlines
\A # Anchor the search at the start of the string
.*? # Match any number of characters, as few as possible
(?=<div class="StoryGoesBelow">) # and stop right before this <div>
This will fail, of course, if the text <div class="StoryGoesBelow">
could also occur in a comment or a literal string somewhere above the actual tag.
Upvotes: 1