Reputation: 1035
I am trying to parse a pdf and categorize information based on text formatting/decoration. How do you suggest I do that?
For example, I have a pdf in which the structure is repeated:
S.No. BOLD+UNDERLINED TITLE para
How do I categorize this data into an array of objects based on text decoration:
[
{ sno: "", title: "", desc: "" },
...
]
Upvotes: 1
Views: 2471
Reputation: 1035
I went through the documentation for pdf2json and figured that I might have to use pdfData.formImage.Pages[pageNumber].Texts[wordNumber].R[0]
object after parsing the pdf to get hold of values I need.
The property TS
of the above object is an array, the value at TS[2]
corresponds to whether the text is bold
(value = 1) or not (value = 0). I could not find any details on data related to underline
text-decoration.
I also needed to initialize the parser as follows:
let pdfParser = new PDFParser(null, 1)
.
Check this for more details.
Upvotes: 2