Reputation: 387
I'm trying to rename LI and TABLE which is coming from HTML Coversion Like
Document{-> RETAINTYPE(MARKUP)};
LI{->MARK(List)};
Document{-> RETAINTYPE};
Its Fine. But When Im Using Same Script for Table Like
DECLARE TableContent;
Document{-> RETAINTYPE(MARKUP)};
TABLE{->MARK(TableContent)};
Document{-> RETAINTYPE};
Its Not tagged
Input File
<table class="IM-Core-Table TableOverride-1" id="t1" border="1">
<colgroup><col /></colgroup>
<colgroup><col /></colgroup>
<colgroup><col /></colgroup>
<colgroup><col /></colgroup><tbody>
<tr class="IM-Core-Table _idGenTableRowColumn-1">
<td valign="top" style=""><p class="MsoNormal"><aname="para201">ICD-10</a></p>
</td>
<td valign="top" style=""><p class="MsoNormal"><a name="para202">Males</a></p>
</td>
<td valign="top" style=""><p class="MsoNormal"><a name="para203">Females</a></p>
</td>
<td valign="top" style=""><p class="MsoNormal"><a name="para204">Total</a></p>
</td>
</tr>
<tr class="IM-Core-Table _idGenTableRowColumn-1">
Mood disorders (F30-F39)
2 10 12 Neurotic, stress-related and somatoform disorders (F40- F48) 0 5 5 Problems related to social environment (Z60) 0 2 2</tbody>
</table>
Upvotes: 0
Views: 73
Reputation: 3113
The problem is that the html contains spaces and lines breaks. By default, the HtmlAnnotator creates an annotation for the content of an html element. This means that, if there is a line break after the opening tag, then the created annotation starts at the offset of the line break. Line breaks like white spaces and markup are not visible by default, and everything that starts with something invisible is also invisible. The simplest solution would be to make them visible temporarily and trim the begin/end of any unwanted/invisible spans, e.g., whitespaces and line breaks.
Here's the script I used for testing this:
TYPESYSTEM utils.HtmlTypeSystem;
ENGINE utils.HtmlAnnotator;
EXEC(HtmlAnnotator, {TAG});
DECLARE TableContent;
RETAINTYPE(MARKUP, WS);
TABLE{-> TRIM(WS)};
TABLE{-> TableContent};
RETAINTYPE;
When I work with the HtmlAnnotator, I often do something like:
RETAINTYPE(MARKUP, WS);
TAG{-> TRIM(MARKUP, WS)};
RETAINTYPE;
DISCLAIMER: I am a developer of UIMA Ruta
Upvotes: 0