JavaCoder
JavaCoder

Reputation: 135

Regex regular-expression Java String

I have code:

private static final Pattern TAG_REGEX = Pattern.compile("<p>(.+?)</p>");
private static List<String> getTagValues(final String str) {
    final List<String> tagValues = new ArrayList<String>();
    final Matcher matcher = TAG_REGEX.matcher(str);
    while (matcher.find()) {
        tagValues.add(matcher.group(1));
    }
    return tagValues;
}
            System.out.println(Arrays.toString(getTagValues(stringText).toArray()));

and i want get from this:

"<html><head></head><body><p>Aa , aa.</p><p><b>Aa aa, aa.</b></p><p>Aa aa aa, aa.</p><p><i>Aa, aa.</i></p><p><b><i>B, b, b.</i></b></p><b>Aa aa, aa.</b></body></html>" 

I want only the text beetwen <p> and </p>

i want get only this: 

"Aa aa Aa aa aa Aa aa aa aa Aa aa B b b" 

But i don't know what i have to write in Pattern.compile("");anyone help?

Upvotes: 1

Views: 215

Answers (3)

artemisian
artemisian

Reputation: 3106

You don't need Pattern nor Matcher for that, you could do a String replace instead:

str.replaceAll(".*?(<p>.*</p>).*", " $1 ").replaceAll(".*?<p>(.*?)</p>.*?", " $1 ").replaceAll("<[/a-z]+>", " ").replaceAll("[,.]", " ").replaceAll(" +", " ")

It doesn't look pretty but it gets the job done :)

Upvotes: 0

Pavneet_Singh
Pavneet_Singh

Reputation: 37404

I recommend to use JSOUP parser to extract your data from HTML code

1.) Parse your data as Document using Jsoup.parse(string) function.

2.) Get the data of body tag as Element.

3.) Fetch the text of Element tag using element.text().

4.) Optionally you can use replaceAll("\\s*[,.]\\s*","") to remove all commans and dots and format spaces.

    String stringText = "<html><head></head><body><p>Aa , aa.</p><p><b>Aa aa, aa.</b></p><p>Aa aa aa, aa.</p><p><i>Aa, aa.</i></p><p><b><i>B, b, b.</i></b></p><b>Aa aa, aa.</b></body></html>";
    Document document =Jsoup.parse(stringText);
    Element element=document.body();
    String plain_String = element.text().replaceAll("\\s*[,.]\\s*"," ");
    System.out.println(element.text()); // Actual text
    System.out.println(plain_String);   // Formatted text

Output :

Aa , aa. Aa aa, aa. Aa aa aa, aa. Aa, aa. B, b, b.Aa aa, aa.
Aa aa Aa aa aa Aa aa aa aa Aa aa B b b Aa aa aa 

Download Jsoup and add it as a dependency

\\s*[,.]\\s* :\\s* match zero or more spaces

[,.] : match any character mentioned inside [] mean ,.


If you insist the regex solution then use

1.) First remove all unwanted characters like ,. and spaces with replaceAll("\\s*[.,]\\s*", " ")

2.) Use regex <p[<>ib]*>([\\w\\s]+)<\\/[\\w]> with Pattern and Matcher to find your text between tags

3.) Append the found text in StringBuilder and display the result

Code

    String str = "<html><head></head><body><p>Aa , aa.</p><p><b>Aa aa, aa.</b></p><p>Aa aa aa, aa.</p><p><i>Aa, aa.</i></p><p><b><i>B, b, b.</i></b></p><b>Aa aa, aa.</b></body></html>";
    Pattern pattern = Pattern.compile("<p[<>ib]*>([\\w\\s]+)<\\/[\\w]>");
    Matcher matcher = pattern.matcher(str.replaceAll("\\s*[.,]\\s*", " "));
    StringBuilder builder = new StringBuilder();
    while (matcher.find()) {
        builder.append(matcher.group(1));
    }
    System.out.println(builder);

Output :

Aa aa Aa aa aa Aa aa aa aa Aa aa B b b 

Upvotes: 2

Youcef LAIDANI
Youcef LAIDANI

Reputation: 60046

You can try this :

String str = "<html><head></head><body><p>Aa , aa.</p><p><b>Aa aa, aa.</b></p><p>Aa aa aa, aa.</p><p><i>Aa, aa.</i></p><p><b><i>B, b, b.</i></b></p><b>Aa aa, aa.</b></body></html>";
String start = ">", end = "<";
String regexString = Pattern.quote(start) + "(.*?)" + Pattern.quote(end);
Pattern pattern = Pattern.compile(regexString);
Matcher matcher = pattern.matcher(str.replaceAll("[.,]", ""));
while (matcher.find()) {
    if (!matcher.group(1).replaceAll("\\s{2,}", " ").trim().equals("")) {
        System.out.print(matcher.group(1).replaceAll("\\s{2,}", " ") + " ");
    }
}

This gives you :

Aa aa Aa aa aa Aa aa aa aa Aa aa B b b Aa aa aa 

Upvotes: 0

Related Questions