Getting broken text while reading pdf written in eastern language in python

Question

I am facing a problem for while. I am working on a project where I have to make a rest api where it will extract texts from pdf and make json data out of it. The pdf format will be same all time. And I successfully done it by using tabula-py but I am getting broken text like this

"িনবেনর তািরখ which should be this নিবন্ধনের তারিক.

I don't know what is the problem. I tried using different librarys (PyPDF2, pdfminer etc) but getting the same problem. I can't use OCR models because the rest api have to convert pdf to json atleast 400-500 pages on every single request and if I use OCR it will take for ever.

If anyone know how to solve it, Please help me.

I attach my code here I needed :

import tabula
import json

# Path to your PDF file
pdf_path = "small.pdf"

# Use Tabula to extract tables from the PDF
tables = tabula.read_pdf(pdf_path, pages='all', multiple_tables=True, encoding='sujata')

# Initialize an empty list to store the JSON objects
json_data = []

# Process each table
for table in tables:
    # Convert the table to a JSON string and append it to the list
    json_data.extend(table.to_dict(orient="records"))

# Clean up the dictionary entries
cleaned_data = []
for entry in json_data:
    cleaned_entry = {}
    for key, value in entry.items():
        cleaned_key = key.replace("\r", " ").strip()
        cleaned_value = value.replace("\r", " ").strip() if isinstance(value, str) else value
        cleaned_entry[cleaned_key] = cleaned_value
    cleaned_data.append(cleaned_entry)


def remove_unnamed_keys(data):
    cleaned_data = []
    for entry in data:
        cleaned_entry = {k: v for k, v in entry.items() if not k.startswith("Unnamed") and v is not None}
        cleaned_data.append(cleaned_entry)
    return cleaned_data

cleaned_data = remove_unnamed_keys(cleaned_data)

# Write the cleaned data to output.json file
with open("output.json", "w", encoding="utf-8") as json_file:
    json.dump(cleaned_data, json_file, ensure_ascii=False, indent=4)

Getting broken text while reading pdf written in eastern language in python

Answers (0)

Related Questions