Khaled Saleh
Khaled Saleh

Reputation: 158

Line Ordering Issue with Arabic PDF Text Using Google Cloud Document AI

I have an app that uses Document AI to process PDFs and extract text from it. When I use the stable version but still is not accurate. The processed text seems to have its lines mixed up, not following the original document's order. This issue is critical for my application, as it relies on accurate text extraction for further processing.

I tried to read the file directly instead of send it as buffer and following the quick-start guides Google provided and here is the code I wrote:

  async processDocument(fileContent: Buffer | undefined, contentType: string) {
    try {
      const name = `projects/${projectId}/locations/${location}/processors/${processorId}`;

      const encodedFile = Buffer.from(fileContent!).toString("base64");

      const request = {
        name,
        rawDocument: {
          content: encodedFile,
          mimeType: contentType,
        },
      };

      const [result] = await this.client.processDocument(request);
      const { document } = result;

      return document;
    } catch (error) {
      throw error;
    }
  }
  catch(error) {
    throw error;
  }
}

enter image description here

Anyone faced same issue or have idea what could be the problem? In Google cloud console, testing the processor seems to be working fine and returned different response with correct order.

Upvotes: 0

Views: 84

Answers (0)

Related Questions