Akshay Kumar
Akshay Kumar

Reputation: 1035

How to parse a PDF in nodejs

I am trying to parse a pdf and categorize information based on text formatting/decoration. How do you suggest I do that? For example, I have a pdf in which the structure is repeated: S.No. BOLD+UNDERLINED TITLE para

How do I categorize this data into an array of objects based on text decoration:

[ 
  { sno: "", title: "", desc: "" }, 
  ... 
]

Upvotes: 1

Views: 2471

Answers (1)

Akshay Kumar
Akshay Kumar

Reputation: 1035

I went through the documentation for pdf2json and figured that I might have to use pdfData.formImage.Pages[pageNumber].Texts[wordNumber].R[0] object after parsing the pdf to get hold of values I need.

The property TS of the above object is an array, the value at TS[2] corresponds to whether the text is bold (value = 1) or not (value = 0). I could not find any details on data related to underline text-decoration.

I also needed to initialize the parser as follows: let pdfParser = new PDFParser(null, 1).
Check this for more details.

Upvotes: 2

Related Questions