Reputation: 11
I need to extract a table of contents containing chapters and sub-titles using Python code. I have tried various libraries such as PyPDF2 and PyMuPDF, but none of them have met my requirements. I searched for a solution on OpenAI ChatGPT but did not find any relevant information. I am hoping that one of the experts can help me solve this problem.
Upvotes: -2
Views: 1143
Reputation: 111
Using pymupdf it is possible to extract the contents using Document.get_toc() (see https://pymupdf.readthedocs.io/en/latest/document.html#Document.get_toc). However, certain information needs to be embedded in the document to work.
I have found a simple algorithm based on text size is surprisingly good at extracting or finding titles on pages. Simply looking for the largest text on the first page often returns the title, in my PDF library at least. There are lots of cases where it does work but for my PDF library it works most of the time.
If your PDF formatting is consistent, you can use a similar concept to find chapter headings. You can look for the mode of the text size in the document, this will likely be all the normal text. Looking at text size greater than this will possibly give you your headings, sub headings etc.
If you give code examples of what you tried you more likely to get specific responses.
Upvotes: 2