matching a string that does not have the html tag span twice on the same line

Question

i am using .net regex compatible, although indeed using editpadpro. i am reformating from a .pdf to a simple webpage, but some text from the pdf file are not correctly displaying, for instance some strings with black font should be followed by the description of the text in black. but many lines are not together, so the black words string is alone let say:

word

description of the word

and what i want to achieve is

word description of the word

because is a html file,i am dealing with the tag

or

i do need to select just those words that are alone, without interfering with those that are already fine.

so what i want to target are lines like this one

text text text text

where " text" repeated 4 times is a black font text withing the lines to target. but there are lines like this one that i want to avoid

text text text text text text text text text text

what i have been trying is to use the regular expression using jgsoft or .net compatible expression, because i tried to use lookahead (although is not a requirement) , but this doesn't seem to work, i am wondering why

.+?(?:(?!.+?.+?$)){2}

here is another try i did, didn't work as well.

(?!.+().+\1)

i tried using the lookahead at the beginning of the string to match, but at the end i made so many attempts that i prefer to ask people like you, that will probably know how to solve this problem.

so at the end what i want is to remove this part from those lines to target.

because those doble paragraph are not necessary, but just in those specific lines. and by doing that, will look like this

word description of the word

please if you can provide a .net flavor or a perl flavor, to perform with an text editor would be fine, although if you have any other suggestion would be nice as well.

greetings from cuernavaca mexico, sorry for my english and thanks for the help if any.

Martin Baulig · Accepted Answer

If you split this up into smaller pieces, something like this could work:

var valid = "text text text text 
";
var invalid = "
text text text text  text text text 
 text text text 
";
var input = valid + invalid;

foreach (Match match in Regex.Matches (input, "(?!")) {
    var line = match.Groups [1].Value;
    Console.WriteLine ("MATCH: {0}", line);

    var spans = Regex.Matches (line, "(.*?)");
    Console.WriteLine ("SPANS: {0}", spans.Count);
}

So you'd first break things up by first matching any

.....

, then check what's inside.

matching a string that does not have the html tag span twice on the same line

Answers (1)

Related Questions