Reputation: 583
From this file https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/text-extraction/Dart.pdf
I would like to get this kind of result:
{
"file": {
"title": "Dart Programming Language Specification",
"1 Scope": {
"text": "This Ecma standard specifies the syntax and semantics of the Dart program-ming language. It does not specify the APIs of the Dart libraries except where those library elements are essential to the correct functioning of the language itself (e.g, the existence of class Object with methods such as noSuchlethod, runtimeType."
},
"2 Conformance": {
"text": "A conforming implementation of the Dart programming language must pro-vide and support all the APIs (libraries, types, functions, getters, setters, whether top-level, static, instance or local) mandated in this specification. A conforming implementation is permitted to provide additional APIs, but not additional syntax, except for experimental features in support of null-aware cascades that are likelv to be introduced in the next revision of this specification."
},
"3 Normative References": [
{
"text": "The following referenced documents are indispensable for the application of this document. For dated references, only the edition cited applies. undated references, the latest edition of the referenced document (including any amendments) applies.",
"1": "The Unicode Standard, Version 5.0, as amended by Unicode 5.1.0, or successor.",
"2": "Dart API Reference, https://api.dartlang.org/"
}
]
...
}
}
My first idea was to perform layout detection with Deep Learning using OCR techniques (notably Tesseract) with Detectron2 and libraries such as deepdoctection and layout-parser. But after some tests, the detection of the text layout doesn't seem to be taken into account. I can only extract "the overall layout" with titles, text boxes and tables. The sorting of the detection according to the coordinates of the boxes must be done afterwards.
My second idea is to first convert the PDF file into a text file with a text extraction approach preserving the layout. Several solutions exist like PyMuPDF. And then perform a processing on the text file that generates the outline with the parts (titles, subtitles, texts etc..) as a dictionary according to the white spaces of the file. But this solution seems not robust because I can have pdf files where there is no indentation on the titles of the parts.
Is there a way to perform layout detection of each part of the layout with its text and these associated subparts?
Upvotes: 0
Views: 1645
Reputation: 11867
Once you have text you can convert it to json text e.g. https://www.npmjs.com/package/text-2-json (PDF binary needs convert to text first)
In order to maintain indentation the strings of text need a replacement for the voids at each side of a line (RTL or LTR) one way is to use textual HTML as output rather than plain text. Certainly do NOT use OCR if you already have PDF structure and styles for the characters. Run this PDF2HTM output to see how well it emulates the PDF (similar methodology to PDF.js text but without the js.) so body text is 10pt and headings are 14.3pt
body {background-color:slategray}
div {position:relative;background-color:white;margin:1em auto;box-shadow:1px 1px 8px -2px black}
p {position:absolute;white-space:pre;margin:0}
<div id="page1" style="width:612.0pt;height:792.0pt">
<p style="top:91.9pt;left:133.8pt;line-height:10.0pt"><i><span style="font-family:LMRomanSlant10,serif;font-size:10.0pt">Dart Programming Language Specification</span></i></p>
<p style="top:91.9pt;left:472.5pt;line-height:10.0pt"><span style="font-family:LMRoman10,serif;font-size:10.0pt">6</span></p>
<p style="top:123.3pt;left:133.8pt;line-height:14.3pt"><b><span style="font-family:LMRoman12,serif;font-size:14.3pt">1</span></b></p>
<p style="top:123.3pt;left:158.0pt;line-height:14.3pt"><b><span style="font-family:LMRoman12,serif;font-size:14.3pt">Scope</span></b></p>
<p style="top:132.3pt;left:498.4pt;line-height:2.0pt"><span style="font-family:LMRoman5,serif;font-size:2.0pt;color:#ffffff">ecmaScope</span></p>
<p style="top:148.6pt;left:148.7pt;line-height:10.0pt"><span style="font-family:LMRoman10,serif;font-size:10.0pt">This Ecma standard specifies the syntax and semantics of the Dart program-</span></p>
<p style="top:160.6pt;left:133.8pt;line-height:10.0pt"><span style="font-family:LMRoman10,serif;font-size:10.0pt">ming language. It does not specify the APIs of the Dart libraries except where</span></p>
<p style="top:172.5pt;left:133.8pt;line-height:10.0pt"><span style="font-family:LMRoman10,serif;font-size:10.0pt">those library elements are essential to the correct functioning of the language</span></p>
<p style="top:184.5pt;left:133.8pt;line-height:10.0pt"><span style="font-family:LMRoman10,serif;font-size:10.0pt">itself (e.g., the existence of class</span><tt><span style="font-family:LMMono10,monospace;font-size:10.0pt"> Object</span></tt><span style="font-family:LMRoman10,serif;font-size:10.0pt"> with methods such as</span><tt><span style="font-family:LMMono10,monospace;font-size:10.0pt"> noSuchMethod</span></tt><span style="font-family:LMRoman10,serif;font-size:10.0pt">,</span></p>
<p style="top:196.4pt;left:133.8pt;line-height:10.0pt"><tt><span style="font-family:LMMono10,monospace;font-size:10.0pt">runtimeType</span></tt><span style="font-family:LMRoman10,serif;font-size:10.0pt">).</span></p>
<p style="top:225.7pt;left:133.8pt;line-height:14.3pt"><b><span style="font-family:LMRoman12,serif;font-size:14.3pt">2</span></b></p>
<p style="top:225.7pt;left:158.0pt;line-height:14.3pt"><b><span style="font-family:LMRoman12,serif;font-size:14.3pt">Conformance</span></b></p>
<p style="top:234.7pt;left:498.4pt;line-height:2.0pt"><span style="font-family:LMRoman5,serif;font-size:2.0pt;color:#ffffff">ecmaConformance</span></p>
<p style="top:251.0pt;left:148.7pt;line-height:10.0pt"><span style="font-family:LMRoman10,serif;font-size:10.0pt">A conforming implementation of the Dart programming language must pro-</span></p>
<p style="top:262.9pt;left:133.8pt;line-height:10.0pt"><span style="font-family:LMRoman10,serif;font-size:10.0pt">vide and support all the APIs (libraries, types, functions, getters, setters, whether</span></p>
<p style="top:274.9pt;left:133.8pt;line-height:10.0pt"><span style="font-family:LMRoman10,serif;font-size:10.0pt">top-level, static, instance or local) mandated in this specification.</span></p>
<p style="top:286.8pt;left:148.7pt;line-height:10.0pt"><span style="font-family:LMRoman10,serif;font-size:10.0pt">A conforming implementation is permitted to provide additional APIs, but</span></p>
<p style="top:298.8pt;left:133.8pt;line-height:10.0pt"><span style="font-family:LMRoman10,serif;font-size:10.0pt">not additional syntax, except for experimental features in support of null-aware</span></p>
<p style="top:310.8pt;left:133.8pt;line-height:10.0pt"><span style="font-family:LMRoman10,serif;font-size:10.0pt">cascades that are likely to be introduced in the next revision of this specification.</span></p>
<p style="top:340.0pt;left:133.8pt;line-height:14.3pt"><b><span style="font-family:LMRoman12,serif;font-size:14.3pt">3</span></b></p>
<p style="top:340.0pt;left:158.0pt;line-height:14.3pt"><b><span style="font-family:LMRoman12,serif;font-size:14.3pt">Normative References</span></b></p>
<p style="top:349.0pt;left:498.4pt;line-height:2.0pt"><span style="font-family:LMRoman5,serif;font-size:2.0pt;color:#ffffff">ecmaNormativeReferences</span></p>
<p style="top:365.3pt;left:148.7pt;line-height:10.0pt"><span style="font-family:LMRoman10,serif;font-size:10.0pt">The following referenced documents are indispensable for the application</span></p>
NOTE: There is some 2.0pt white text RH side (tagging?)
1 Scope
ecmaScope
This Ecma standard....
this will be white in indented html, and can add to complexity of non indented xml/stext/txt extraction, (move the text view below to see it as text) so best source of
indenting as above is
mutool convert -pretty -o dart.html dart.pdf
however that will be single lines just like in the pdf. And the second best alternative is pdftotext.exe -layout dart.pdf
but you need to parse as text
Dart Programming Language Specification 6
1 Scope ecmaScope
This Ecma standard specifies the syntax and semantics of the Dart program-
ming language. It does not specify the APIs of the Dart libraries except where
those library elements are essential to the correct functioning of the language
itself (e.g., the existence of class Object with methods such as noSuchMethod,
runtimeType).
2 Conformance ecmaConformance
A conforming implementation of the Dart programming language must pro-
vide and support all the APIs (libraries, types, functions, getters, setters, whether
top-level, static, instance or local) mandated in this specification.
A conforming implementation is permitted to provide additional APIs, but
not additional syntax, except for experimental features in support of null-aware
cascades that are likely to be introduced in the next revision of this specification.
3 Normative References ecmaNormativeReferences
Upvotes: 0