vikifor
vikifor

Reputation: 3466

Match new line using regular expressions java?

I like to get html content using regular expressions. I have problems when the content is written in multiple lines. No matches are found. Here is the regular expression that I use:

String regExpContent = "<div class=\"views-field views-field-body\">(\\s+)<span class=\"field-content\">([\\:\\,\\w\\s\\.\\„\\”\\-\\(\\)0123456789(&nbsp;)(\r?\n)]+)</span>(\\s+)</div>";
Pattern regExpMatcherContent = Pattern.compile(regExpContent,
            Pattern.DOTALL | Pattern.UNICODE_CHARACTER_CLASS);

I use (\r?\n) to match new line. Can anybody help me?

Upvotes: 0

Views: 172

Answers (2)

Anirudha
Anirudha

Reputation: 32797

The problem is that you are using regex to parse html.You should use an html parser.


To answer your question

Your Pattern.DOTALL is redundant because you are not using . anywhere in your regex

\s in your regex would match newlines because it is similar to [\r\n\t ]

The problem is with your [\\:\\,\\w\\s\\.\\„\\”\\-\\(\\)0123456789(&nbsp;)(\r?\n)]+..It should ([:,\\w\\s.„”()-]|&nbsp;)+

Upvotes: 0

Tomalak
Tomalak

Reputation: 338208

Please use an HTML parser.

String html = "<div class=\"views-field views-field-body\">...</div>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();

Elements fieldContent = body.select("div.views-field-body span.field-content");

The use of regex for parsing HMTL has been discouraged so often that I won't repeat any of the arguments here. Suffice it to say that you really should not do it.

Upvotes: 1

Related Questions