Reputation: 436
I have a document which contains sections such as Assessments, HPI, ROS, Vitals etc. I want to extract notes in each section. I am using GATE for this purpose. I have made a JAPE file which will extract notes in the Assessment section. Following is the grammar,
Input: Token
Options: control=appelt debug=true
Rule: Assess
({Token.string =~"(?i)diagnose[d]?"}{Token.string=="with"} | {Token.string=~"(?i)suffering"}{Token.string=~"(?i)from"} | {Token.string=~"(?i)suffering"}{Token.string=~"(?i)with"})
(
({Token})*
):assessments
({Token.string =~"(?i)HPI"} | {Token.string =~"(?i)ROS"} | {Token.string =~"(?i)EXAM"} | {Token.string =~"(?i)VITAL[S]"} | {Token.string =~"(?i)TREATMENT[s]"} |{Token.string=~"(?i)use[d]?"}{Token.string=~"(?i)orderset[s]?"} | {Token.string=~"$"})
-->
:assessments.Assessments = {}
Now, when the assessment section is in the end of the document I can retrieve the notes properly. But if it is somewhere between two sections then this will return entire document from assessment section till the end of file.
I have tried using {Token.string=~"$"} in different ways but could not extract ONLY THE ASSESSMENT SECTION IRRESPECTIVE OF ITS PLACE IN THE DOC.
Please explain how can I achieve this using JAPE grammar.
Upvotes: 2
Views: 1211
Reputation: 122364
That is correct since Appelt mode always prefers the longest possible overall match. Since any Token can match string =~ "$"
the assessments
label will grab all but the final token in the document.
I would adopt a two pass approach, using an initial gazetteer or JAPE phase to annotate the "section headings" and then another phase with only these heading annotations in its input line
Imports: { import static gate.Utils.*; }
Phase: AnnotateBetweenHeadings
Input: Heading
Options: control = appelt
Rule: TwoHeadings
({Heading.type ="assessments"}):h1
(({Heading})?):h2
-->
{
Long endOffset = end(doc);
AnnotationSet h2Annots = bindings.get("h2");
if(h2Annots != null && !h2Annots.isEmpty()) {
endOffset = start(h2Annots);
}
outputAS.add(end(bindings.get("h1")), endOffset, "Assessments", featureMap());
}
This will annotate everything between the end of the assessments heading and the start of the following heading, or the end of the document if there is no following heading.
Upvotes: 1
Reputation: 23
Tyson Hamilton provides this alternative to annotating EOD since $ doesn't work in JAPE:
Rule: DOCMARKERS
// we need to match something even though we don't use it directly
(({Token})):doc
-->
:doc{
FeatureMap features = Factory.newFeatureMap();
features.put("rule", ruleName());
try {
outputAS.add(0L, 0L, "SOD", features);
outputAS.add(docAnnots.getDocument().getContent().size(), docAnnots.getDocument().getContent().size(), "EOD", features);
} catch (InvalidOffsetException ioe) {
throw new GateRuntimeException(ioe);
}
}
I found that EOD was only recognized in later rules by giving it some length. So I have this:
Rule: DOCMARKERS
Priority: 2
(
({Sentence}) // we need to matching something even though we don't use it directly
):doc
-->
:doc{
FeatureMap features = Factory.newFeatureMap();
features.put("rule", "DOCMARKERS");
try {
outputAS.add(0L, 0L, "SOD", features);
long docsize = docAnnots.getDocument().getContent().size();
// The only way I could get EOD to be recognized in later rules was to
// give it some length, hence the -2 and -1
outputAS.add(docsize-2, docsize-1, "EOD", features);
System.err.println("Debug: added EOD");
} catch (InvalidOffsetException ioe) {
throw new GateRuntimeException(ioe);
}
}
And then you should be able to change the end of your rule to
...| {Token.string=~"$"})
Upvotes: 0