Reputation: 3466
I like to get html content using regular expressions. I have problems when the content is written in multiple lines. No matches are found. Here is the regular expression that I use:
String regExpContent = "<div class=\"views-field views-field-body\">(\\s+)<span class=\"field-content\">([\\:\\,\\w\\s\\.\\„\\”\\-\\(\\)0123456789( )(\r?\n)]+)</span>(\\s+)</div>";
Pattern regExpMatcherContent = Pattern.compile(regExpContent,
Pattern.DOTALL | Pattern.UNICODE_CHARACTER_CLASS);
I use (\r?\n)
to match new line. Can anybody help me?
Upvotes: 0
Views: 172
Reputation: 32797
The problem is that you are using regex to parse html.You should use an html parser.
To answer your question
Your Pattern.DOTALL
is redundant because you are not using .
anywhere in your regex
\s
in your regex would match newlines because it is similar to [\r\n\t ]
The problem is with your [\\:\\,\\w\\s\\.\\„\\”\\-\\(\\)0123456789( )(\r?\n)]+
..It should ([:,\\w\\s.„”()-]| )+
Upvotes: 0
Reputation: 338208
Please use an HTML parser.
String html = "<div class=\"views-field views-field-body\">...</div>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();
Elements fieldContent = body.select("div.views-field-body span.field-content");
The use of regex for parsing HMTL has been discouraged so often that I won't repeat any of the arguments here. Suffice it to say that you really should not do it.
Upvotes: 1