Line Ordering Issue with Arabic PDF Text Using Google Cloud Document AI

Question

I have an app that uses Document AI to process PDFs and extract text from it. When I use the stable version but still is not accurate. The processed text seems to have its lines mixed up, not following the original document's order. This issue is critical for my application, as it relies on accurate text extraction for further processing.

I tried to read the file directly instead of send it as buffer and following the quick-start guides Google provided and here is the code I wrote:

  async processDocument(fileContent: Buffer | undefined, contentType: string) {
    try {
      const name = `projects/${projectId}/locations/${location}/processors/${processorId}`;

      const encodedFile = Buffer.from(fileContent!).toString("base64");

      const request = {
        name,
        rawDocument: {
          content: encodedFile,
          mimeType: contentType,
        },
      };

      const [result] = await this.client.processDocument(request);
      const { document } = result;

      return document;
    } catch (error) {
      throw error;
    }
  }
  catch(error) {
    throw error;
  }
}

Anyone faced same issue or have idea what could be the problem? In Google cloud console, testing the processor seems to be working fine and returned different response with correct order.

Line Ordering Issue with Arabic PDF Text Using Google Cloud Document AI

Answers (0)

Related Questions