Extracting text in html using Java Regex

Question

I need to extract text from html tags. I have written a code but the text is not being extracted. Below is my code

import java.util.regex.Matcher;
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.regex.Pattern;
class getFontTagText{
String result = null;
public static void main(String args[]){
    try{
           getFontTagText text = new getFontTagText();
           BufferedReader r = new BufferedReader(new FileReader("target.html"));
           Pattern p = Pattern.compile("(//AZUZZU Full Service Provision)",Pattern.MULTILINE);
           String line;
           System.out.println("Came here");
           while((line = r.readLine()) != null){
           Matcher mat = p.matcher(line);

           while(mat.find()){
                System.out.println("Came here");
                String st = mat.group(1);
                System.out.format("'%s'
", st);
            }
        }
    }catch (Exception e){
        System.out.println(e);
    }
}

}

and the html file is here

     
         ZUZZU Full Service Provision
     
     
         ü ö ä Ä Ü Ö ß

mat.group(1) is being printed 'null' instead of text. Any help is much appreciated.

Eritrean · Accepted Answer

I would recommend to use jsoup. jsoup is a Java library for extracting and manipulating HTML data, using CSS, and jquery-like methods. In your case it could look like something like this :

    public static void jsoup() throws IOException{
    File input = new File("C:\users\uzochi\desktop\html.html");
    Document doc = Jsoup.parse(input, "UTF-8");
    Elements es = doc.select("FONT");//select tag 
    for(Element e : es){
        System.out.println(e.text());
    }    
}

If you prefer to use regex just match the text between > and < , for example

public static void regex(){
Pattern pat = Pattern.compile("]*>(.*?)");//
String s = "
" +
            "
" +
            "
" +
            "
" +
            "         ZUZZU Full Service Provision
" +
            "     
" +
            "     
" +
            "         ü ö ä Ä Ü Ö ß
" +
            "     
" +
            "
" +
            "
" +
            "";
Matcher m = pat.matcher(s);
while (m.find()) {
    String found = m.group(1);
    System.out.println("Found : " + found);      
}

}

Extracting text in html using Java Regex

Answers (1)

Related Questions