Loïc Février
Loïc Février

Reputation: 7750

Mysql REGEX for detecting long lines

I have some records in my database that looks like that :

Lorem ipsum dolor sit amet, consectetur adipiscing elit.......
<PRE>
one short line
an other short line
a very long line I want to detect with more than 80 caracterssssssssssssssssss
again some short lines
</PRE>
Nullam tristique nisl eu lacus fringilla porta. ........

I would like to detect long lines (>80 caracters) inside the PRE tags and then I will edit them manually.

I tried something like this

SELECT * FROM table WHERE column 
    REGEXP "<PRE>.*[\n\r]+[^\n\r]{80,}[\n\r]+.*</PRE>"

but it's returning records where there is no long lines.

Can someone point me in the right direction ?

Upvotes: 0

Views: 336

Answers (4)

Alan Moore
Alan Moore

Reputation: 75242

The [^\n\r]{80,} isn't necessarily matching a line in the PRE element where it starts searching. The .* could be matching the closing </PRE> tag and beyond, so the long line could be in another PRE element if there is one, or even in the text between PRE elements.

I don't think there's a bullet-proof way to do what you want in MySQL, but you could try this:

<PRE>[^<]*[\n\r][^\n\r<]{80,}

You've said there won't be any other markup inside the PRE element, so any angle bracket in its content should be in the form of an escape sequence like &lt;, and the first < the regex encounters should be one in the </PRE> tag.

It's a hack, but without lookaheads, this is the only way I can think of to constrain the match to within the same PRE element. To do this job right, you should do it outside MySQL altogether.

Upvotes: 1

Phrogz
Phrogz

Reputation: 303361

<PRE>\s*[^\n\r]{80,}.*?</PRE>

Note that this assumes that the </PRE> tag never comes on the same line as the content. (If it did, you could consume 74 characters of 'long line' followed by the closing tag, and then you would consume a lot of content up until the next closing tag.)

Upvotes: 0

cababunga
cababunga

Reputation: 3114

If there could be more then one <PRE> block, you expression can swallow space in between them. Change [^\n\r]{80,} to [^\n\r]{80,}?.

Upvotes: 0

bcosca
bcosca

Reputation: 17555

Use .*? instead of .* so the regex parser isn't greedy

Upvotes: 1

Related Questions