Roman K.C.
Roman K.C.

Reputation: 49

How to extract some mathematical expressionfrom pdf using python?

I have a pdf which has math equations like this

I am trying to extract the objective questions from a pdf file and convert them into csv file using python in such a way that each row of table contain a question, four options in each column and a correct option (so total six columns). But that pdf also have those mathematical equations which I can't write them into csv file as they are . Is it possible to write those equations in my csv file as they are in pdf file ?

Upvotes: 2

Views: 6560

Answers (1)

Maksym Polshcha
Maksym Polshcha

Reputation: 18358

This depends on how the formula is represented in PDF. It can be either XObject, inline image or unicode text.

Try pdfreader. It can extract plain texts, texts containing PDF commands and images from PDF documents.

from pdfreader import SimplePDFViewer, PageDoesNotExist

fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)

plain_text = ""
pdf_markdown = ""
images = []
try:
    while True:
        viewer.render()
        pdf_markdown += viewer.canvas.text_content
        plain_text += "".join(viewer.canvas.strings)
        images.extend(viewer.canvas.inline_images)
        images.extend(viewer.canvas.images.values())
        viewer.next()
except PageDoesNotExist:
    pass

Upvotes: 1

Related Questions