Reputation: 825
I have a java string demonstrating a div element:
String source = "<div class = \"ads\">\n" +
"\t<dl style = \"font-size:14px; color:blue;\">\n" +
"\t\t<li>\n" +
"\t\t\t<a href = \"http://ggicci.blog.163.com\" target = \"_blank\">Ggicci's Blog</a>\n" +
"\t\t</li>\n" +
"\t</dl>\n" +
"</div>\n";
which in html form is:
<div class = "ads">
<dl style = "font-size:14px; color:blue;">
<li>
<a href = "http://ggicci.blog.163.com" target = "_blank">Ggicci's Blog</a>
</li>
</dl>
</div>
And I write such a regex to extract dl element:
<dl[.\\s]*?>[.\\s]*?</div>
But it finds nothing and I modified it to be:
<dl(.|\\s)*?>(.|\\s)*?</div>
then it works. So I tested like this:
System.out.println(Pattern.matches("[.\\s]", "a")); --> false
System.out.println(Pattern.matches("[abc\\s]", "a")); --> true
so why the '.' cant match 'a' ?
Upvotes: 0
Views: 76
Reputation: 75222
When you include regexes in a post, it's a good idea to post them as you're actually using them--in this case, as Java string literals.
"[.\\s]"
is a Java string literal representing the regex [.\s]
; it matches a literal dot or a whitespace character. Your regex is not trying to match a backslash or an 's', as others have said, but the crucial factor is that .
loses its special meaning inside a character class.
"(.|\\s)"
is a Java string literal representing the regex (.|\s)
; it matches (anything but a line separator character OR any whitespace character). It works as you intended, but don't use it! It leaves you extremely vulnerable to catastrophic backtracking, as explained in this answer.
But no worries, all you really need to do is use DOTALL mode (also known as single-line mode), which enables .
to match anything including line separator characters.
(?s)<dl\b[^>]*>.*?</dl>
Upvotes: 0
Reputation: 11992
the syntax [.\\s]
makes no sense, because, and Daniel said, the .
just means "a dot" in this context.
Why can't you replace your [.\\s]
with a much simpler .
?
Upvotes: 0
Reputation: 8560
+1 for above.
I would do:
<dl[^>]*>(.*?)</dl>
To match the content of dl
Upvotes: 0
Reputation: 174309
Inside the square brackets, the characters are treated literaly. [.\\s]
means "Match a dot, or a backslash or a s".
(.|\\s)
is equivalent to .
.
I think you really want the following regex:
<dl[^>]*>.*?</div>
Upvotes: 3