alex
alex

Reputation: 95

matching a string that does not have the html tag span twice on the same line

i am using .net regex compatible, although indeed using editpadpro. i am reformating from a .pdf to a simple webpage, but some text from the pdf file are not correctly displaying, for instance some strings with black font should be followed by the description of the text in black. but many lines are not together, so the black words string is alone let say:

word

description of the word

and what i want to achieve is

word description of the word

because is a html file,i am dealing with the tag

</span> or <br/>

i do need to select just those words that are alone, without interfering with those that are already fine.

so what i want to target are lines like this one

<p><span class="font7" style="font-weight:bold;">text text text text </span></p>\r\n<p>

where " text" repeated 4 times is a black font text withing the lines to target. but there are lines like this one that i want to avoid

<p><span class="font7" style="font-weight:bold;">text text text text </span><span class="font7"> text text text <br/> text text text </span></p>\r\n<p>

what i have been trying is to use the regular expression using jgsoft or .net compatible expression, because i tried to use lookahead (although is not a requirement) , but this doesn't seem to work, i am wondering why

<p><span class="font7" style="font-weight:bold;">.+?(?:(?!.+?</span>.+?$)){2}</p>\r\n<p>

here is another try i did, didn't work as well.

<p><span class="font7" style="font-weight:bold;">(?!.+(</span>).+\1)</p>\r\n<p>

i tried using the lookahead at the beginning of the string to match, but at the end i made so many attempts that i prefer to ask people like you, that will probably know how to solve this problem.

so at the end what i want is to remove this part from those lines to target.

</p>\r\n<p>

because those doble paragraph are not necessary, but just in those specific lines. and by doing that, will look like this

word description of the word

please if you can provide a .net flavor or a perl flavor, to perform with an text editor would be fine, although if you have any other suggestion would be nice as well.

greetings from cuernavaca mexico, sorry for my english and thanks for the help if any.

Upvotes: 0

Views: 125

Answers (1)

Martin Baulig
Martin Baulig

Reputation: 3020

If you split this up into smaller pieces, something like this could work:

var valid = "<p><span class=\"font7\" style=\"font-weight:bold;\">text text text text </span></p>\r\n<p>";
var invalid = "<p><span class=\"font7\" style=\"font-weight:bold;\">text text text text </span><span class=\"font7\"> text text text <br/> text text text </span></p>\r\n<p>";
var input = valid + invalid;

foreach (Match match in Regex.Matches (input, "<p>(?!<p)(.*?)</p>")) {
    var line = match.Groups [1].Value;
    Console.WriteLine ("MATCH: {0}", line);

    var spans = Regex.Matches (line, "<span.*?>(.*?)</span>");
    Console.WriteLine ("SPANS: {0}", spans.Count);
}

So you'd first break things up by first matching any <p>.....</p>, then check what's inside.

Upvotes: 1

Related Questions