Reputation: 53
I am facing a problem for while. I am working on a project where I have to make a rest api where it will extract texts from pdf and make json data out of it. The pdf format will be same all time. And I successfully done it by using tabula-py
but I am getting broken text like this
"িনবেনর তািরখ
which should be this নিবন্ধনের তারিক
.
I don't know what is the problem. I tried using different librarys (PyPDF2, pdfminer etc) but getting the same problem. I can't use OCR models because the rest api have to convert pdf to json atleast 400-500 pages on every single request and if I use OCR it will take for ever.
If anyone know how to solve it, Please help me.
I attach my code here I needed :
import tabula
import json
# Path to your PDF file
pdf_path = "small.pdf"
# Use Tabula to extract tables from the PDF
tables = tabula.read_pdf(pdf_path, pages='all', multiple_tables=True, encoding='sujata')
# Initialize an empty list to store the JSON objects
json_data = []
# Process each table
for table in tables:
# Convert the table to a JSON string and append it to the list
json_data.extend(table.to_dict(orient="records"))
# Clean up the dictionary entries
cleaned_data = []
for entry in json_data:
cleaned_entry = {}
for key, value in entry.items():
cleaned_key = key.replace("\r", " ").strip()
cleaned_value = value.replace("\r", " ").strip() if isinstance(value, str) else value
cleaned_entry[cleaned_key] = cleaned_value
cleaned_data.append(cleaned_entry)
def remove_unnamed_keys(data):
cleaned_data = []
for entry in data:
cleaned_entry = {k: v for k, v in entry.items() if not k.startswith("Unnamed") and v is not None}
cleaned_data.append(cleaned_entry)
return cleaned_data
cleaned_data = remove_unnamed_keys(cleaned_data)
# Write the cleaned data to output.json file
with open("output.json", "w", encoding="utf-8") as json_file:
json.dump(cleaned_data, json_file, ensure_ascii=False, indent=4)
Upvotes: 0
Views: 92