Ggicci
Ggicci

Reputation: 825

java regex why these two regular expressions are different

I have a java string demonstrating a div element:

String source = "<div class = \"ads\">\n" +
                "\t<dl style = \"font-size:14px; color:blue;\">\n" +
                "\t\t<li>\n" +
                "\t\t\t<a href = \"http://ggicci.blog.163.com\" target = \"_blank\">Ggicci's Blog</a>\n" +
                "\t\t</li>\n" +
                "\t</dl>\n" +
                "</div>\n";

which in html form is:

<div class = "ads">
    <dl style = "font-size:14px; color:blue;">
        <li>
            <a href = "http://ggicci.blog.163.com" target = "_blank">Ggicci's Blog</a>
        </li>
    </dl>
</div>

And I write such a regex to extract dl element:

<dl[.\\s]*?>[.\\s]*?</div>

But it finds nothing and I modified it to be:

<dl(.|\\s)*?>(.|\\s)*?</div>

then it works. So I tested like this:

System.out.println(Pattern.matches("[.\\s]", "a")); --> false
System.out.println(Pattern.matches("[abc\\s]", "a")); --> true

so why the '.' cant match 'a' ?

Upvotes: 0

Views: 76

Answers (4)

Alan Moore
Alan Moore

Reputation: 75222

When you include regexes in a post, it's a good idea to post them as you're actually using them--in this case, as Java string literals.

"[.\\s]" is a Java string literal representing the regex [.\s]; it matches a literal dot or a whitespace character. Your regex is not trying to match a backslash or an 's', as others have said, but the crucial factor is that . loses its special meaning inside a character class.

"(.|\\s)" is a Java string literal representing the regex (.|\s); it matches (anything but a line separator character OR any whitespace character). It works as you intended, but don't use it! It leaves you extremely vulnerable to catastrophic backtracking, as explained in this answer.

But no worries, all you really need to do is use DOTALL mode (also known as single-line mode), which enables . to match anything including line separator characters.

(?s)<dl\b[^>]*>.*?</dl>

Upvotes: 0

Orab&#238;g
Orab&#238;g

Reputation: 11992

the syntax [.\\s] makes no sense, because, and Daniel said, the . just means "a dot" in this context.

Why can't you replace your [.\\s] with a much simpler . ?

Upvotes: 0

morja
morja

Reputation: 8560

+1 for above.

I would do:

<dl[^>]*>(.*?)</dl>

To match the content of dl

Upvotes: 0

Daniel Hilgarth
Daniel Hilgarth

Reputation: 174309

Inside the square brackets, the characters are treated literaly. [.\\s] means "Match a dot, or a backslash or a s".


(.|\\s) is equivalent to ..


I think you really want the following regex:

<dl[^>]*>.*?</div>

Upvotes: 3

Related Questions