Reputation: 1
This is my problem: i need to extract the text between the tag "p
" without the XML notation using SAX Parser
<title>1. Introduction</title>
<p>The Lorem ipsum
<xref ref-type="bibr" rid="B1">
1
</xref>.
Lorem ipsum 23.
</p>
<p>The L domain recruits an ATP-requiring cellular factor for this
scission event, the only known energy-dependent step in assembly
<xref ref-type="bibr" rid="B2">
2
</xref>.
Domain is used here to denote the amino
acid sequence that constitutes the biological function.
</p>
Is it possible using endElement()
? Because when i use it i obtain only the part after "/xref
" tag
Here the code
public void endElement(String s, String s1, String element) throws SAXException {
if(element.equals(Finals.PARAGRAPH)){
Paragraph paragraph = new Paragraph();
paragraph.setContext(tmpValue);
System.out.println("Contesto: " + tmpValue);
listP.add(paragraph);
}
}
@Override
public void characters(char[] ac, int i, int j) throws SAXException {
tmpValue = new String(ac, i, j);
}
This is what i expect to do: a list listP
containing the two paragraphs:
1) Lorem ipsum 1 Lorem ipsum 23.
2) The L domain recruits an ATP-requiring cellular factor for this
scission event, the only known energy-dependent step in assembly 2
Domain is used here to denote the amino
acid sequence that constitutes the biological function.
Upvotes: 0
Views: 1924
Reputation: 91
Use a stack
Push
in startElement
events and Pop
in endElement
events.
Or if that doesn't work for you, just Push
into the stack for all events and then after endOfDocument
, Pop
the elements one by one. Store the data from </p>
to <p>
in reverse.
Upvotes: 0
Reputation: 35891
There are many possible solutions. Usually using SAX parsers you just add some boolean flags to denote some particular states when parsing. In this simple example you can achieve this with just changing this:
tmpValue = new String(ac, i, j);
to this:
if (tmpValue.equals(""))
tmpValue = new String(ac, i, j);
else
tmpValue += new String(ac, i, j);
or:
if (tmpValue == null)
tmpValue = new String(ac, i, j);
else
tmpValue += new String(ac, i, j);
Depending on how you initialize the tmpValue
variable (and you should initialize it if you're not doing it already).
To gather contents of all paragraphs you need to:
public void endElement(String s, String s1, String element) throws SAXException {
if (element.equals(Finals.PARAGRAPH)) {
Paragraph paragraph = new Paragraph();
paragraph.setContext(tmpValue);
System.out.println("Contesto: " + tmpValue);
listP.add(paragraph);
tmpValue = ""; // or tmpValue = null; for the second version
}
}
and to omit the title part:
public void startElement(
String uri,
String localName,
String qName,
Attributes atts) {
if (localName.equals(Finals.PARAGRAPH)) {
tmpValue = ""; // or tmpValue = null; for the second version
}
}
Upvotes: 0
Reputation: 8058
I'm not sure what you mean by "is it possible using endElement", but it's certainly possible. You'd need to write your SAX application so it:
(1) ignores all startElement
/endElement
events between the ones for the <p>
aragraph -- simple state tracking, or perhaps you can simply say that you aren't interested in elements other than paragraphs and make your element event handlers be no-ops for anything you don't care about.
(2) accumulates separately-delivered characters()
events until the endElement
for the <p>
aragraph. But you need to do this anyway, because SAX always reserves the right to deliver contiguous text as several characters()
calls, for reasons having to do with parser buffer management.
Upvotes: 2