SonicDash829
SonicDash829

Reputation: 69

Python PDF Parser - Engineering Drawing

I am trying to write a Python Script to parse through a PDF file using PyPDF2. Only thing is, my PDF file isnt your traditional document, it's an engineering drawing.

Anyway, I need the code to parse through the text that is written on the bottom right corner, as well as a red stamp that has text written on it. The drawing will look something like this: enter image description here

I tried to write some basic code to just parse it and extract the data, but its not working.

import PyPDF2

# creating a pdf file object 
pdfFileObj = open('example.pdf', 'rb') 
  
# creating a pdf reader object 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
  
# printing number of pages in pdf file 
print(pdfReader.numPages) 
  
# creating a page object 
pageObj = pdfReader.getPage(0) 
  
# extracting text from page 
print(pageObj.extractText()) 
  
# closing the pdf file object 
pdfFileObj.close()

Anyone have any recomendations?

Upvotes: 0

Views: 2631

Answers (4)

K J
K J

Reputation: 11914

What you need is to rescan the original: as in all such cases, it's Rubbish in = Rubbish out.

NEVER scan in a LOSSY format it adds too much chatter to the background. SCAN fresh as TIFF PGM or PNG in greyscale.

Ensure the resolution is such that the smallest letters are a couple of dozen pixels high.

With that low quality image this is about as good as can be expected:

enter image description here

Upvotes: 0

Rish Rish
Rish Rish

Reputation: 147

You can try using the pdfplumber library instead, which is a more advanced PDF parsing library that can handle different types of PDFs.

Link: [https://github.com/jsvine/pdfplumber][1]

Upvotes: 0

Sean
Sean

Reputation: 1

specify an X and Y range in visitor https://pypdf2.readthedocs.io/en/3.0.0/user/extract-text.html.

see below example

from PyPDF2 import PdfReader

reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")
page = reader.pages[3]

parts = []


def visitor_body(text, cm, tm, fontDict, fontSize):
    y = tm[5]
    if y > 50 and y < 720:
        parts.append(text)


page.extract_text(visitor_text=visitor_body)
text_body = "".join(parts)

print(text_body)

Upvotes: 0

jmattes
jmattes

Reputation: 36

Late to the party...

None the less, we developed a commercial product to do exactly that: Werk24. It has a simple python client pip install werk24

With this your task becomes very simple. You can read the Title Block with a simple command. Imagine you want to obtain the Designation

from werk24 import Hook, W24AskTitleBlock
from werk24.models.techread import W24TechreadMessage
from werk24.utils import w24_read_sync

from . import get_drawing_bytes # define your own


def recv_title_block(message: W24TechreadMessage) -> None:
    """ Print the Designation

    NOTE: Other fields like Drawing ID, Material etc are
    also available.
    """
    print(message.payload_dict.get('designation'))


if __name__ == "__main__":

    # submit the request to Werk24
    w24_read_sync(
        get_drawing_bytes(), 
        [Hook(
          ask=W24AskTitleBlock(), 
          function=recv_title_block
        )])

For the drawing that your provided, the response will be:

"designation": {
    "captions": [
        {
            "language": "eng",
            "text": "Descr"
        }
    ],
    "values": [
        {
            "language": "eng",
            "test": "Shaft",
        }
    ]
}

NOTE: Your files is very blurry, so I created the response manually - the API requires a minimal resolution of 180 dpi (also works with TIF and DXF files).

Upvotes: 1

Related Questions