Reputation: 43

How to extract the required parts of the text from the image instead of extracting all the text in an image using OCR?

Below are some images of transactions where I convert them from PDF files to images (jpg).

Images Converted from PDF

Now, how can I extract the required parts of the text from the image(circled in red) as shown below using any OCR Python packages?

Parts of text wanted to extract

Note: The reason I convert the PDF files to images (jpg) is that some of the PDF files are scanned PDF files, which means they are not native PDF files. The images I showed above are converted from native PDF files.

Upvotes: 0

Answers (1)

Sparrow1029

Reputation: 642

Okay, so I would suggest censoring out the banking data in the above images (if these are real statements for companies/people...). As an experiment, I resized the image to 3x larger (hence the pixel coordinate adjustments) and did an example here with a couple of the sections.

#!/usr/bin/env python3
import pytesseract
from PIL import Image
from pprint import pprint

with Image.open("BCA_Bank.png") as img:
    img = img.resize((img.width*3, img.height*3))
    # Do a binary threshold on the image to make it solid black & white
    # the top two sections needed a different value than the bottom
    # because of font weight.
    topimg = img.convert("L").point(lambda p: 255 if p > 85 else 0).convert('1')
    bottomimg = img.convert("L").point(lambda p: 255 if p > 200 else 0).convert('1')

    sec1 = topimg.crop((29*3, 82*3, 358*3, 174*3))
    sec2 = topimg.crop((574*3, 83*3, 749*3, 163*3))
    tanggal = bottomimg.crop((41*3, 310*3, 86*3, 842*3))
    keterangan1 = bottomimg.crop((107*3, 291*3, 230*3, 757*3))
    keterangan2 = bottomimg.crop((239*3, 291*3, 388*3, 757*3))

    # This will open all 5 sections in temporary windows as image previews
    sec1.show()
    sec2.show()
    tanggal.show()
    keterangan1.show()
    keterangan2.show()

    # This could be abstracted so as not to be repetetive.
    sec1_text = pytesseract.image_to_string(sec1, config="--psm 6", lang="ind")
    sec2_text = pytesseract.image_to_string(sec2, config="--psm 6", lang="ind")
    tanggal_col = pytesseract.image_to_string(tanggal, config="--psm 4", lang="ind")
    keterangan1_col = pytesseract.image_to_string(keterangan1, config="--psm 4", lang="ind")
    keterangan2_col = pytesseract.image_to_string(keterangan2, config="--psm 4", lang="ind")

    headers = ["SEC1", "SEC2", "TANGGAL", "KETERANGAN1", "KETERANGAN2"]  #"CBG", "MUTASI", "SALDO"]
    col_data = [
        sec1_text.strip().splitlines(),
        sec2_text.strip().splitlines()
    ] + [
        [line.strip() for line in col.splitlines() if line]
        for col in [tanggal_col, keterangan1_col, keterangan2_col]
    ]
    pprint(dict(zip(headers, col_data)))

This will only work if all of the sheets have almost exactly the same size & structure -- since the .crop calls specifically select boxes from the image.

I'm not great with pandas/numpy, so I'm sure there's a more pythonic approach to structuring the parsed data from tesseract using one or both of those libraries.

*Note regarding the --psm options:

--psm 4 means 'Assume a single column of text of variable sizes.'
--psm 6 means 'Assume a single uniform block of text.'

In a terminal, type tesseract --help-psm for more info.

Upvotes: 3

How to extract the required parts of the text from the image instead of extracting all the text in an image using OCR?

Answers (1)

Related Questions