Muhammad Irfan Ali
Muhammad Irfan Ali

Reputation: 135

Extract text from PDF (Table of Contents) Ignoring page and indexing numbers

I am working on extracting text from PDF and save it in .csv file. Below image shows the text I am trying to extract from the PDF:

enter image description here

Currently, I am able to extract text but can't get rid of the numbers that indicate page numbers and indexing (i.e., numbers at the start and end of the text 1, 5, 1.1, 5, 1.2 etc...). Below is my working code (I am working on python 3.5):

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO, BytesIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages = maxpages, password = password, caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()
    text = text.replace('\n\n', ' ').replace('\n',' ').replace('–',' ').replace('_',' ').replace('\t',' ').encode('ascii', errors='replace').decode('utf-8').replace("?","").replace("\x0c","").replace(".","").replace('\\',"").replace('/',"").replace('\r',"").replace("-"," ").replace(".......*"," ")
    text = " ".join(text.split())
    fp.close()
    device.close()
    retstr.close()
    return text

content = convert_pdf_to_txt('filename.pdf')

#print (content.encode('utf-8'))
s = StringIO(content)
with open('output.csv', 'w') as f:
    for line in s:
        f.write(line)

Thanks in advance for the help.

Upvotes: 3

Views: 4368

Answers (2)

dalanicolai
dalanicolai

Reputation: 347

The pdfminer documentation here shows how to do it in section 2.4.

For the record I'll copy-paste the relevant code here.

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument

# Open a PDF document.
fp = open('mypdf.pdf', 'rb')
parser = PDFParser(fp)
document = PDFDocument(parser, password)
# Get the outlines of the document.
outlines = document.get_outlines()
for(level,title,dest,a,se) in outlines:
    print (' '.join(title.split(' ')[1:]))

The print statement was adapted to appropriately answer the question.

Upvotes: 3

Denny
Denny

Reputation: 177

You can just extract the TOC by mutool:

mutool show your.pdf outline > toc.txt

Then convert the content of txt to a csv file.

And I know mutool from this answer: Extract toc from pdf by mutool

Upvotes: 1

Related Questions