林抿均
林抿均

Reputation: 43

How to extract the required parts of the text from the image instead of extracting all the text in an image using OCR?

Below are some images of transactions where I convert them from PDF files to images (jpg).

Images Converted from PDF

  1. BCA Bank
  2. Maybank

Now, how can I extract the required parts of the text from the image(circled in red) as shown below using any OCR Python packages?

Parts of text wanted to extract

  1. BCA Bank
  2. Maybank

Note: The reason I convert the PDF files to images (jpg) is that some of the PDF files are scanned PDF files, which means they are not native PDF files. The images I showed above are converted from native PDF files.

Upvotes: 0

Views: 870

Answers (1)

Sparrow1029
Sparrow1029

Reputation: 642

Okay, so I would suggest censoring out the banking data in the above images (if these are real statements for companies/people...). As an experiment, I resized the image to 3x larger (hence the pixel coordinate adjustments) and did an example here with a couple of the sections.

#!/usr/bin/env python3
import pytesseract
from PIL import Image
from pprint import pprint

with Image.open("BCA_Bank.png") as img:
    img = img.resize((img.width*3, img.height*3))
    # Do a binary threshold on the image to make it solid black & white
    # the top two sections needed a different value than the bottom
    # because of font weight.
    topimg = img.convert("L").point(lambda p: 255 if p > 85 else 0).convert('1')
    bottomimg = img.convert("L").point(lambda p: 255 if p > 200 else 0).convert('1')

    sec1 = topimg.crop((29*3, 82*3, 358*3, 174*3))
    sec2 = topimg.crop((574*3, 83*3, 749*3, 163*3))
    tanggal = bottomimg.crop((41*3, 310*3, 86*3, 842*3))
    keterangan1 = bottomimg.crop((107*3, 291*3, 230*3, 757*3))
    keterangan2 = bottomimg.crop((239*3, 291*3, 388*3, 757*3))

    # This will open all 5 sections in temporary windows as image previews
    sec1.show()
    sec2.show()
    tanggal.show()
    keterangan1.show()
    keterangan2.show()

    # This could be abstracted so as not to be repetetive.
    sec1_text = pytesseract.image_to_string(sec1, config="--psm 6", lang="ind")
    sec2_text = pytesseract.image_to_string(sec2, config="--psm 6", lang="ind")
    tanggal_col = pytesseract.image_to_string(tanggal, config="--psm 4", lang="ind")
    keterangan1_col = pytesseract.image_to_string(keterangan1, config="--psm 4", lang="ind")
    keterangan2_col = pytesseract.image_to_string(keterangan2, config="--psm 4", lang="ind")

    headers = ["SEC1", "SEC2", "TANGGAL", "KETERANGAN1", "KETERANGAN2"]  #"CBG", "MUTASI", "SALDO"]
    col_data = [
        sec1_text.strip().splitlines(),
        sec2_text.strip().splitlines()
    ] + [
        [line.strip() for line in col.splitlines() if line]
        for col in [tanggal_col, keterangan1_col, keterangan2_col]
    ]
    pprint(dict(zip(headers, col_data)))

This will only work if all of the sheets have almost exactly the same size & structure -- since the .crop calls specifically select boxes from the image.

I'm not great with pandas/numpy, so I'm sure there's a more pythonic approach to structuring the parsed data from tesseract using one or both of those libraries.

*Note regarding the --psm options:

  • --psm 4 means 'Assume a single column of text of variable sizes.'
  • --psm 6 means 'Assume a single uniform block of text.'

In a terminal, type tesseract --help-psm for more info.

Upvotes: 3

Related Questions