Reputation: 95
i am using .net regex compatible, although indeed using editpadpro. i am reformating from a .pdf to a simple webpage, but some text from the pdf file are not correctly displaying, for instance some strings with black font should be followed by the description of the text in black. but many lines are not together, so the black words string is alone let say:
word
description of the word
and what i want to achieve is
word description of the word
because is a html file,i am dealing with the tag
</span> or <br/>
i do need to select just those words that are alone, without interfering with those that are already fine.
so what i want to target are lines like this one
<p><span class="font7" style="font-weight:bold;">text text text text </span></p>\r\n<p>
where " text" repeated 4 times is a black font text withing the lines to target. but there are lines like this one that i want to avoid
<p><span class="font7" style="font-weight:bold;">text text text text </span><span class="font7"> text text text <br/> text text text </span></p>\r\n<p>
what i have been trying is to use the regular expression using jgsoft or .net compatible expression, because i tried to use lookahead (although is not a requirement) , but this doesn't seem to work, i am wondering why
<p><span class="font7" style="font-weight:bold;">.+?(?:(?!.+?</span>.+?$)){2}</p>\r\n<p>
here is another try i did, didn't work as well.
<p><span class="font7" style="font-weight:bold;">(?!.+(</span>).+\1)</p>\r\n<p>
i tried using the lookahead at the beginning of the string to match, but at the end i made so many attempts that i prefer to ask people like you, that will probably know how to solve this problem.
so at the end what i want is to remove this part from those lines to target.
</p>\r\n<p>
because those doble paragraph are not necessary, but just in those specific lines. and by doing that, will look like this
word description of the word
please if you can provide a .net flavor or a perl flavor, to perform with an text editor would be fine, although if you have any other suggestion would be nice as well.
greetings from cuernavaca mexico, sorry for my english and thanks for the help if any.
Upvotes: 0
Views: 125
Reputation: 3020
If you split this up into smaller pieces, something like this could work:
var valid = "<p><span class=\"font7\" style=\"font-weight:bold;\">text text text text </span></p>\r\n<p>";
var invalid = "<p><span class=\"font7\" style=\"font-weight:bold;\">text text text text </span><span class=\"font7\"> text text text <br/> text text text </span></p>\r\n<p>";
var input = valid + invalid;
foreach (Match match in Regex.Matches (input, "<p>(?!<p)(.*?)</p>")) {
var line = match.Groups [1].Value;
Console.WriteLine ("MATCH: {0}", line);
var spans = Regex.Matches (line, "<span.*?>(.*?)</span>");
Console.WriteLine ("SPANS: {0}", spans.Count);
}
So you'd first break things up by first matching any <p>.....</p>
, then check what's inside.
Upvotes: 1