Reputation: 69
I am trying to write a Python Script to parse through a PDF file using PyPDF2. Only thing is, my PDF file isnt your traditional document, it's an engineering drawing.
Anyway, I need the code to parse through the text that is written on the bottom right corner, as well as a red stamp that has text written on it. The drawing will look something like this: enter image description here
I tried to write some basic code to just parse it and extract the data, but its not working.
import PyPDF2
# creating a pdf file object
pdfFileObj = open('example.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file
print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()
Anyone have any recomendations?
Upvotes: 0
Views: 2631
Reputation: 11914
What you need is to rescan the original: as in all such cases, it's Rubbish in = Rubbish out.
NEVER scan in a LOSSY format it adds too much chatter to the background. SCAN fresh as TIFF PGM or PNG in greyscale.
Ensure the resolution is such that the smallest letters are a couple of dozen pixels high.
With that low quality image this is about as good as can be expected:
Upvotes: 0
Reputation: 147
You can try using the pdfplumber library instead, which is a more advanced PDF parsing library that can handle different types of PDFs.
Link: [https://github.com/jsvine/pdfplumber][1]
Upvotes: 0
Reputation: 1
specify an X and Y range in visitor https://pypdf2.readthedocs.io/en/3.0.0/user/extract-text.html.
see below example
from PyPDF2 import PdfReader
reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")
page = reader.pages[3]
parts = []
def visitor_body(text, cm, tm, fontDict, fontSize):
y = tm[5]
if y > 50 and y < 720:
parts.append(text)
page.extract_text(visitor_text=visitor_body)
text_body = "".join(parts)
print(text_body)
Upvotes: 0
Reputation: 36
Late to the party...
None the less, we developed a commercial product to do exactly that: Werk24. It has a simple python client pip install werk24
With this your task becomes very simple. You can read the Title Block with a simple command. Imagine you want to obtain the Designation
from werk24 import Hook, W24AskTitleBlock
from werk24.models.techread import W24TechreadMessage
from werk24.utils import w24_read_sync
from . import get_drawing_bytes # define your own
def recv_title_block(message: W24TechreadMessage) -> None:
""" Print the Designation
NOTE: Other fields like Drawing ID, Material etc are
also available.
"""
print(message.payload_dict.get('designation'))
if __name__ == "__main__":
# submit the request to Werk24
w24_read_sync(
get_drawing_bytes(),
[Hook(
ask=W24AskTitleBlock(),
function=recv_title_block
)])
For the drawing that your provided, the response will be:
"designation": {
"captions": [
{
"language": "eng",
"text": "Descr"
}
],
"values": [
{
"language": "eng",
"test": "Shaft",
}
]
}
NOTE: Your files is very blurry, so I created the response manually - the API requires a minimal resolution of 180 dpi (also works with TIF and DXF files).
Upvotes: 1