Reputation: 3573
I need to extract main contents (excluding links,advertisements,etc) from a news web page.I have read about it on web and came to know that to do that I need to parse html page and then select contents from html tags.I have written a code which takes a html file as input and extracts the text from the web page using Htmleditorkit available in java.swing.* .
import java.io.IOException;
import java.io.FileReader;
import java.io.Reader;
import java.util.List;
import java.util.ArrayList;
import javax.swing.text.html.parser.ParserDelegator;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.MutableAttributeSet;
public class HTMLUtils {
private HTMLUtils() {}
public static List<String> extractText(Reader reader) throws IOException {
final ArrayList<String> list = new ArrayList<String>();
ParserDelegator parserDelegator = new ParserDelegator();
ParserCallback parserCallback = new ParserCallback() {
@Override
public void handleText(final char[] data, final int pos) {
list.add(new String(data));
}
@Override
public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) { }
@Override
public void handleEndTag(Tag t, final int pos) { }
@Override
public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { }
@Override
public void handleComment(final char[] data, final int pos) { }
@Override
public void handleError(final java.lang.String errMsg, final int pos) { }
};
parserDelegator.parse(reader, parserCallback, true);
return list;
}
public static void main(String[] args) throws Exception{
FileReader reader = new FileReader("C://Users//Mukul//Desktop//demo.html");
List<String> lines = HTMLUtils.extractText(reader);
for (String line : lines) {
System.out.println(line);
}
}
}
But my problem is I'm not abled to figure out how can I select only main content from a web page like an article from a news web page.
Also,I want to know the way I'm doing parsing is fine or should I use some open source libraries like Jsoup,Jtidy,etc. for same thing.
Please help me and correct me where I'm doing wrong.
Upvotes: 1
Views: 1092
Reputation: 1303
well you have two problems, one is getting the page contents (syntactic i guess), for which i would use the following idiom: (not that theres something terribly wrong with the code you posted, just a bit too verbose for my taste)
String text = new Scanner( new URL("C://Users//Mukul//Desktop//demo.html").openConnection().getInputStream()).useDelimiter("\\A").next();
and the other is interpreting the String you just read (semantic). I dont think theres a single right answer but if its one single page you want to parse everytime, it should have some fixed layout. you will have to find some pattern to distinguish main content from advertisements, headers, links, etc. and then maybe you can extract it using regexes.
Check this: http://code.google.com/p/boilerpipe/
Upvotes: 0