Reputation: 135
I have code:
private static final Pattern TAG_REGEX = Pattern.compile("<p>(.+?)</p>");
private static List<String> getTagValues(final String str) {
final List<String> tagValues = new ArrayList<String>();
final Matcher matcher = TAG_REGEX.matcher(str);
while (matcher.find()) {
tagValues.add(matcher.group(1));
}
return tagValues;
}
System.out.println(Arrays.toString(getTagValues(stringText).toArray()));
and i want get from this:
"<html><head></head><body><p>Aa , aa.</p><p><b>Aa aa, aa.</b></p><p>Aa aa aa, aa.</p><p><i>Aa, aa.</i></p><p><b><i>B, b, b.</i></b></p><b>Aa aa, aa.</b></body></html>"
I want only the text beetwen <p>
and </p>
i want get only this:
"Aa aa Aa aa aa Aa aa aa aa Aa aa B b b"
But i don't know what i have to write in Pattern.compile("");
anyone help?
Upvotes: 1
Views: 215
Reputation: 3106
You don't need Pattern nor Matcher for that, you could do a String replace instead:
str.replaceAll(".*?(<p>.*</p>).*", " $1 ").replaceAll(".*?<p>(.*?)</p>.*?", " $1 ").replaceAll("<[/a-z]+>", " ").replaceAll("[,.]", " ").replaceAll(" +", " ")
It doesn't look pretty but it gets the job done :)
Upvotes: 0
Reputation: 37404
I recommend to use JSOUP
parser to extract your data from HTML
code
1.) Parse your data as Document
using Jsoup.parse(string)
function.
2.) Get the data of body
tag as Element
.
3.) Fetch the text of Element
tag using element.text()
.
4.) Optionally you can use replaceAll("\\s*[,.]\\s*","")
to remove all commans and dots and format spaces.
String stringText = "<html><head></head><body><p>Aa , aa.</p><p><b>Aa aa, aa.</b></p><p>Aa aa aa, aa.</p><p><i>Aa, aa.</i></p><p><b><i>B, b, b.</i></b></p><b>Aa aa, aa.</b></body></html>";
Document document =Jsoup.parse(stringText);
Element element=document.body();
String plain_String = element.text().replaceAll("\\s*[,.]\\s*"," ");
System.out.println(element.text()); // Actual text
System.out.println(plain_String); // Formatted text
Output :
Aa , aa. Aa aa, aa. Aa aa aa, aa. Aa, aa. B, b, b.Aa aa, aa.
Aa aa Aa aa aa Aa aa aa aa Aa aa B b b Aa aa aa
Download Jsoup and add it as a dependency
\\s*[,.]\\s*
:\\s*
match zero or more spaces
[,.]
: match any character mentioned inside []
mean ,.
If you insist the regex
solution then use
1.) First remove all unwanted characters like ,.
and spaces with replaceAll("\\s*[.,]\\s*", " ")
2.) Use regex <p[<>ib]*>([\\w\\s]+)<\\/[\\w]>
with Pattern
and Matcher
to find your text between tags
3.) Append the found text in StringBuilder
and display the result
Code
String str = "<html><head></head><body><p>Aa , aa.</p><p><b>Aa aa, aa.</b></p><p>Aa aa aa, aa.</p><p><i>Aa, aa.</i></p><p><b><i>B, b, b.</i></b></p><b>Aa aa, aa.</b></body></html>";
Pattern pattern = Pattern.compile("<p[<>ib]*>([\\w\\s]+)<\\/[\\w]>");
Matcher matcher = pattern.matcher(str.replaceAll("\\s*[.,]\\s*", " "));
StringBuilder builder = new StringBuilder();
while (matcher.find()) {
builder.append(matcher.group(1));
}
System.out.println(builder);
Output :
Aa aa Aa aa aa Aa aa aa aa Aa aa B b b
Upvotes: 2
Reputation: 60046
You can try this :
String str = "<html><head></head><body><p>Aa , aa.</p><p><b>Aa aa, aa.</b></p><p>Aa aa aa, aa.</p><p><i>Aa, aa.</i></p><p><b><i>B, b, b.</i></b></p><b>Aa aa, aa.</b></body></html>";
String start = ">", end = "<";
String regexString = Pattern.quote(start) + "(.*?)" + Pattern.quote(end);
Pattern pattern = Pattern.compile(regexString);
Matcher matcher = pattern.matcher(str.replaceAll("[.,]", ""));
while (matcher.find()) {
if (!matcher.group(1).replaceAll("\\s{2,}", " ").trim().equals("")) {
System.out.print(matcher.group(1).replaceAll("\\s{2,}", " ") + " ");
}
}
This gives you :
Aa aa Aa aa aa Aa aa aa aa Aa aa B b b Aa aa aa
Upvotes: 0