Reputation: 841
I need to extract one tag from a page (which contains multiple children) and afterwards split the retrieved text at tags which contain multiple stars (*). I need to remove the tags with the stars and then split the text in to parts which I would like to store in a StringArray.
I used http://htmlparser.sourceforge.net/ before and it worked fine for extracting text out of specific tags.
public class ToeGuideParser extends NodeVisitor{
private static final String TAG = "ToeGuideParser";
final String url = "http://p7510.teamovercome.net/?page_id=18";
private String Guide;
Context context;
int tag_number = 0;
public ToeGuideParser () throws ParserException{
this(null);
}
public ToeGuideParser(Context context) throws ParserException{
context = this.context;
long bfr = startStopWatch();
Parser parser = new Parser (url);
parser.visitAllNodesWith(this);
stopStopWatch(bfr);
}
public void visitTag (Tag tag){
String tagName = tag.getTagName();
String content = tag.toPlainTextString();
//Log.d(TAG, tagName);
if (tagName.equalsIgnoreCase("div")){
Attribute attr = tag.getAttributeEx("class");
if (attr!=null){
String value = attr.getValue();
if (value.equals("entry-content")){
//save
Guide = tag.toHtml(true);
int guide_start = tag.getStartingLineNumber();
int guide_end = tag.getEndingLineNumber();
Log.d(TAG, "Guide starts at "+guide_start+" and ends at "+guide_end);
//Log.d(TAG, Guide);
}
}
}
if (content.contains("*****")){
tag_number++;
int start = tag.getStartingLineNumber();
int end = tag.getEndingLineNumber();
Log.d(TAG, tag_number+" = Tag found at "+start+", ends at "+end);
}
}
private void split (String bfrSplit){
if (bfrSplit != null){
//Log.d(TAG, bfrSplit);
Pattern pattern = Pattern.compile("<([A-Z][A-Z0-9]*).*>[*]+</\1>", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(bfrSplit);
while (matcher.find()){
Log.d(TAG,"Start index: " + matcher.start());
Log.d(TAG," End index: " + matcher.end() + " ");
Log.d(TAG,matcher.group());
}
}
}
public void finishedParsing(){
//split(Guide);
Log.w(TAG, "#########");
Log.w(TAG, "finished");
}
public long startStopWatch(){
return System.currentTimeMillis();
}
public String stopStopWatch(long bfr){
long time = System.currentTimeMillis()-bfr;
String formatedTime = "Time Taken: "+time+" milli's" ;
Log.i(TAG, formatedTime);
return formatedTime;
}
}
public long startStopWatch(){
return System.currentTimeMillis();
}
public String stopStopWatch(long bfr){
long time = System.currentTimeMillis()-bfr;
String formatedTime = "Time Taken: "+time+" milli's" ;
Log.i(TAG, formatedTime);
return formatedTime;
}
}
Problems with this code:
Stacktrace to illustrate:
D / ToeGuideParser ( 2146): 1 = Tag found at 11, ends at 11
D / ToeGuideParser ( 2146): 2 = Tag found at 201, ends at 201
D / ToeGuideParser ( 2146): 3 = Tag found at 202, ends at 202
D / ToeGuideParser ( 2146): 4 = Tag found at 237, ends at 237
D / ToeGuideParser ( 2146): 5 = Tag found at 238, ends at 238
D / ToeGuideParser ( 2146): 6 = Tag found at 239, ends at 239
D / ToeGuideParser ( 2146): Guide starts at 248 and ends at 248
D / ToeGuideParser ( 2146): 7 = Tag found at 248, ends at 248
D / ToeGuideParser ( 2146): 8 = Tag found at 261, ends at 261
D / ToeGuideParser ( 2146): 9 = Tag found at 261, ends at 261
D / ToeGuideParser ( 2146): 10 = Tag found at 280, ends at 280
D / ToeGuideParser ( 2146): 11 = Tag found at 280, ends at 280
D / ToeGuideParser ( 2146): 12 = Tag found at 307, ends at 307
D / ToeGuideParser ( 2146): 13 = Tag found at 318, ends at 318
D / ToeGuideParser ( 2146): 14 = Tag found at 322, ends at 322
D / ToeGuideParser ( 2146): 15 = Tag found at 328, ends at 328
D / ToeGuideParser ( 2146): 16 = Tag found at 350, ends at 350
D / ToeGuideParser ( 2146): 17 = Tag found at 367, ends at 367
D / ToeGuideParser ( 2146): 18 = Tag found at 376, ends at 376
W / ToeGuideParser ( 2146): #########
W / ToeGuideParser ( 2146): finished
I / ToeGuideParser ( 2146): Time Taken: 1021 milli's
Upvotes: 1
Views: 340
Reputation: 7139
For the line numbers problem: I'm assuming you're wondering why you're getting so many lines that say "Tag found at" — there are more than you expected. The problem is that that this line is effectively loading most of the HTML page into one string:
String content = tag.toPlainTextString();
The toPlainTextString()
method includes the text content of the tag's children, so all the parent tags of the tag containing ***** are going to contain ***** too. I think you probably want to use getText()
instead, which doesn't include the text of the children (see JavaDoc).
Note that using toPlainTextString()
like this is best avoided because it would be pretty slow. Because it's called for each tag in the document, you might be forcing the document to be read a hundred times over, unnecessarily...
For the regex problem:
You're not actually calling the split()
method which contains the regex at the moment. But if you were, I assume it would fail because it's trying to match both the start tag, text node and end tag. But HTML Parser only gives you one node at a time, ie:
visitTag()
gives you just the start tagvisitStringNode()
gives you just the text nodevisitEndTag()
gives you just the end tagSo your regex would fail because it expects to get the start, text and end all at once. Also, I think you need to escape the asterisk matcher like this: [\*]+
If you need to match something across all three node, then you need to add some private variables to your class to record state. ie if visitTag()
matches the tag you want, then set a boolean saying that the current tag is valid... then when visitStringNode()
is called it can check that boolean to decide whether to process the text or ignore it. Then unset the boolean when you encounter the end tag.
Upvotes: 1