Difficulty in extracting main content from a news web page

Question

I need to extract main contents (excluding links,advertisements,etc) from a news web page.I have read about it on web and came to know that to do that I need to parse html page and then select contents from html tags.I have written a code which takes a html file as input and extracts the text from the web page using Htmleditorkit available in java.swing.* .

import java.io.IOException;
import java.io.FileReader;
import java.io.Reader;
import java.util.List;
import java.util.ArrayList;

import javax.swing.text.html.parser.ParserDelegator;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.MutableAttributeSet;

public class HTMLUtils {
private HTMLUtils() {}

public static List extractText(Reader reader) throws IOException {
final ArrayList list = new ArrayList();

ParserDelegator parserDelegator = new ParserDelegator();
ParserCallback parserCallback = new ParserCallback() {
        @Override
  public void handleText(final char[] data, final int pos) {
    list.add(new String(data));
  }
        @Override
  public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) { }
        @Override
  public void handleEndTag(Tag t, final int pos) {  }
        @Override
  public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { }
        @Override
  public void handleComment(final char[] data, final int pos) { }
        @Override
  public void handleError(final java.lang.String errMsg, final int pos) { }
 };
 parserDelegator.parse(reader, parserCallback, true);
 return list;
}

public static void main(String[] args) throws Exception{
FileReader reader = new FileReader("C://Users//Mukul//Desktop//demo.html");
List lines = HTMLUtils.extractText(reader);
for (String line : lines) {
  System.out.println(line);
}
}
}

But my problem is I'm not abled to figure out how can I select only main content from a web page like an article from a news web page.

Also,I want to know the way I'm doing parsing is fine or should I use some open source libraries like Jsoup,Jtidy,etc. for same thing.

Please help me and correct me where I'm doing wrong.

Difficulty in extracting main content from a news web page

Answers (1)

Related Questions