Extract text from PDF (Table of Contents) Ignoring page and indexing numbers

Question

I am working on extracting text from PDF and save it in .csv file. Below image shows the text I am trying to extract from the PDF:

Currently, I am able to extract text but can't get rid of the numbers that indicate page numbers and indexing (i.e., numbers at the start and end of the text 1, 5, 1.1, 5, 1.2 etc...). Below is my working code (I am working on python 3.5):

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO, BytesIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages = maxpages, password = password, caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()
    text = text.replace('

', ' ').replace('
',' ').replace('–',' ').replace('_',' ').replace('	',' ').encode('ascii', errors='replace').decode('utf-8').replace("?","").replace("\x0c","").replace(".","").replace('\',"").replace('/',"").replace('
',"").replace("-"," ").replace(".......*"," ")
    text = " ".join(text.split())
    fp.close()
    device.close()
    retstr.close()
    return text

content = convert_pdf_to_txt('filename.pdf')

#print (content.encode('utf-8'))
s = StringIO(content)
with open('output.csv', 'w') as f:
    for line in s:
        f.write(line)

Thanks in advance for the help.

dalanicolai · Accepted Answer

The pdfminer documentation here shows how to do it in section 2.4.

For the record I'll copy-paste the relevant code here.

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument

# Open a PDF document.
fp = open('mypdf.pdf', 'rb')
parser = PDFParser(fp)
document = PDFDocument(parser, password)
# Get the outlines of the document.
outlines = document.get_outlines()
for(level,title,dest,a,se) in outlines:
    print (' '.join(title.split(' ')[1:]))

The print statement was adapted to appropriately answer the question.

Extract text from PDF (Table of Contents) Ignoring page and indexing numbers

Answers (2)

Related Questions