Encoding Issue When Attempting to Convert Hindi Script PDF to CSV in Python

Question

I'm currently attempting to convert a PDF file containing Hindi Devanagari script to a CSV file using the fitz library in Python, but when I read in the text I encounter a strange encoding issue.

Here is a page from the PDF:

Page From Hindi Script PDF

I am using the following code to read in the text:

import fitz
from indic_transliteration import sanscript
from indic_transliteration.sanscript import transliterate

def extract_text_from_pdf(pdf_file):
    text = ""
    with fitz.open(pdf_file) as pdf_document:
        for page_num in range(len(pdf_document)):
            page = pdf_document.load_page(page_num)
            text += page.get_text()
    return text

def devanagari_to_roman(text):
    return transliterate(text, sanscript.DEVANAGARI, sanscript.ITRANS)

def main():
    pdf_file = 'data/agra_2010.pdf'  
    extracted_text = extract_text_from_pdf(pdf_file)
    roman_text = devanagari_to_roman(extracted_text)
    print(roman_text)

if __name__ == "__main__":
    main()

But here is an example of how the Hindi text appears in the output:

òû£û™û
òû£û™û
òû£û™û
òû£û™û
ftys dk uke&
ftys dk dksM &

I encounter the same issue when using the tabula function read_pdf, so this isn't simply an issue with the library I'm using. I would like to maintain the Hindi script in my output. Please let me know if you have any solutions. Thanks!

Encoding Issue When Attempting to Convert Hindi Script PDF to CSV in Python

Answers (1)

Related Questions