David
David

Reputation: 55

How can I send a PDF document to Google Cloud Document AI using the v1beta3 API?

I have been successful in sending a PDF File stored in GCS to the Document AI v1beta2 API. But in v1beta3 API, the file approach is no longer supported. It requires me to send the content in the JSON. Here is the documentation I am following: https://cloud.google.com/document-ai/docs/form-parser#v1beta3

Some questions:

  1. What if anything do I have to do to the PDF content returned from a GET request? The PDF content appears to be in a base64 string which is what the API requires.

  2. Looking at the API request, do you see anything incorrect?

REQUEST INFORMATION
ID: N/A
Method: POST
URL/Path: https://us-documentai.googleapis.com/v1beta3/projects/38072577434/locations/us/processors/cd8a06d0cd3cb045:process
Headers: Content-Type: application/json, Accept: application/json
Authorization: :censored:6:c2dc31949c: :censored:179:27504afa53:
Params: N/A

Data:
{"document":{"mimeType":"application/pdf","content":["%PDF-1.4\n1 0 obj\n<<\n/Title (��\u0000C\u0000r\u0000y\u0000s\u0000t\u0000a\u0000l\u0000 \u0000R\u0000e\u0000p\u0000o\u0000r\u0000t\u0000 \u0000V\u0000i\u0000e\u0000w\u0000e\u0000r)\n/Creator (��\u0000w\u0000k\u0000h\u0000t\u0000m\u0000l\u0000t\u0000o\u0000p\u0000d\u0000f\u0000 \u00000\u0000.\u00001\u00002\u0000.\u00005)\n/Producer (��\u0000Q\u0000t\u0000 \u00004\u0000.\u00008\u0000.\u00007)\n/CreationDate (D:20201219164504Z)\n>>\nendobj\n3 0 obj\n<<\n/Type /ExtGState\n/SA true\n/SM 0.02\n/ca 1.0\n/CA 1.0\n/AIS false\n/SMask /None>>\nendobj\n4 0 obj\n[/Pattern /DeviceRGB]\nendobj\n8 0 obj\n<<\n/Type /Annot\n/Subtype /Link\n/Rect [3.75000000  339.500000  102.750000  345.500000 ]\n/Border [0 0 0]\n/A <<\n/Type /Action\n/S /URI\n/URI (http://www.schooldude.com/)\n>>\n>>\nendobj\n9 0 obj\n<<\n/Type /Catalog\n/Pages 2 0 R\n>>\nendobj\n5 0 obj\n<<\n/Type /Page\n/Parent 2 0 R\n/Contents 10 0 R\n/Resources 12 0 R\n/Annots 13 0 R\n/MediaBox [0 0 595 842]\n>>\nendobj\n12 0 obj\n<<\n/ColorSpace <<\n/PCSp 4 0 R\n/CSp /DeviceRGB\n/CSpg /DeviceGray\n>>\n/ExtGState <<\n/GSa 3 0 R\n>>\n/Pattern <<\n>>\n/Font <<\n/F6 6 0 R\n/F7 7 0 R\n>>\n/XObject <<\n>>\n>>\nendobj\n13 0 obj\n[ 8 0 R ]\nendobj\n10 0 obj\n<<\n/Length 11 0 R\n/Filter /FlateDecode\n>>\nstream\nx��]M�ܸ\u0011����9���o\u0012\b\u0002x>6@\u000e\u0001\f\u001b�!�!�f\u0013,֋8{��\u000fI}t���\u001eq\u001e=�x�X���*=U�*V\u0015)��\u001f?����oͻ���n>�?\u001f>\u001eڣ3m�_���]~�l>9|m�\u001e>\u001c>Ŀ������\u000b꘾)l����_���h�\u0012.~�NM_���/�k~�\u0002\u0007H\"����w\u001d�x�=�����Jex��#��#�x�nU�\u001c~J\b�j�||�s��\u001b��)��s�׿�\u000f<\u0003�}�\u001c���û\u001f�M���O\u0011sV\\��S�z'T�����I\u0015h>�|x��AQ\u0015'�9J\u0019Z�\u001a���\u0007����C8*i������σ<�������r���rF��T�\u0003�8r��|9����7����,od�e���Y�}�gce��a�}�@'�\u0002#B�(M�g��J|�-���d\u001f��3v1�x��]4����E��d�\tc�\u0013�W���\\�\u001a�7w։:���\u000e��Vh�Ą1ZJx���dި���/�~��d�B�8x�\u00030�����|\u001f9TrI�\r�E}tM�\u0015�\u00006J�䉐\u0004�g\u0002o�BB6w��\n .�\u001e��5\u0018��[\u001a�\u0014;�\u0002%�s�D��\f y�c�ډ�Xe���P&+V�L�$f�sEF��\u0018;�ۉ��nkFO�*�,{\u0014V�Q3I܃��)b%S��]���>��ZDɍ�;!@\u000e��\u0018�M�e@\u0016�e���\u001b�w€\u001c��1@Iz��\"�\f\u0018Q�\u0018��n\u0006\u0001�j�\u0002:\u0016�d\u0006d�\\\u0006�(y\f���\u0001���\f�|�[@��[Aw�b��\\�)�xϚ*�f����P�(�\u0012}#\u0015��#\u0015���0r�ȕ\u0018\u0011Z�G��-Y����\\��[�c��\u0018��q1b$�E��h�`�\\\f�������,1�9\u0004�\u0016ۖ�ň��E�D�(���T�p��0r�ȕ\u0018\u0011\u001a6\u0017�es1a$k�*1�$ag:*�3����E�D�hz�������\\�]]@m�'%�'\u0015��ep�I�\\������?�\u0002H�+����O\"��\u001fQ2\u0019���{�@�\u0002W��PǓ�m�\u001c\u0018Q\u00129���v<\u0017T�\u0003]Ӗʁ$�́\u0011%�\u0003�\u0011�\u0013\u000e�Np5\u000etS7T\u000e$�d\u000e�(�\u001c��A;�@�\u000f�Ɓ��M�z)v����V�>�[]BAH_ک�k;5}q'[��vagw���e���ݸ\u001e\by\u0012y�1=�v��z�\u001d�\u0001Y$�\u0001#J\u001e\u0003Rj�\u0013\u0006�̮\u001a\u0003�4�ʀ,�ˀ\u0011%�\u0001)��\t\u0003r^W�\u0001]S�ʀ,�ˀ\u0011%�\u0001�ͳ\u0013\u0006�.O5\u0006�#\u0000����/v�����w9*�P���?#lbbγ|΢O\u0012o������m�Er�?���?��� @���b@�kh��]�\u0001d���.^\u0004H��>\u0018Ѝ��\u0018��5����+�4}\u0005\u0016W�[/���x��!\u0013�k�\u0012F�\u0016�\u0012#Bk�Z�׺\u0012�h-[�\\��1����y�\u0011�Č�\t�$љ,�=�x��bj�\u0019[�w���O\"��\u001fQ��/T�\u000b\u0001Dj�Uc@Ԭf�EF�Ӣ��\u001cD�\u0012�l�~�D��hD��#��^��H]я��3��\u0019`�\fp�a��ɀ�%%�\u0001I$�\u0001#J\u001e\u0003T?<�\u0001J��\fp��\u001f&\"���\n�X�a$o�E�\u0018)�-�$��b�0r�ȕ\u0018\u0011�\u001f~M\u0012�\u000f�v\u0018�Z�J4�\u0011/M�\u0016\"�t�:\u00171�l\u001ckcN��,�=\u0013 [...]
  1. Here is the error I am receiving:
{
  "error": {
    "code": 400,
    "message": "Invalid JSON payload received. Unknown name \"content\" at 'document': Proto field is not repeating, cannot start list.",
    "status": "INVALID_ARGUMENT",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.BadRequest",
        "fieldViolations": [
          {
            "field": "document",
            "description": "Invalid JSON payload received. Unknown name \"content\" at 'document': Proto field is not repeating, cannot start list."
          }
        ]
      }
    ]
  }
}

2021-01-05 adding code to show how encoding is perfomed:

//
//Function to call each url in an array of urls
//
const requestAsync = function(url) {
    return z.request(url).then((response) => response.content)
}
//
//Create the array of urls to call synchronously
//
var urlArr = [];
const urls = {
  url: 'https://storage.googleapis.com/cloud-samples-data/documentai/loan_form.pdf',
  method: 'GET',
  headers: {
    'Accept': 'application/pdf',
    'raw': true
  }
}
urlArr.push(urls);
//
//Call the function for each item in the urlArr
//
return Promise.all(urlArr.map(requestAsync))
 .then(function(values){
    //
    // Convert the file data to a Buffer and base64 encode it.
    //
    var fileContent = Buffer.from(values[0]).toString('base64');

    const options = {
    url: 'https://us-documentai.googleapis.com/v1beta3/projects/38072577434/locations/us/processors/cd8a06d0cd3cb045:process',
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Accept': 'application/json',
      'Authorization': `Bearer ${bundle.authData.access_token}`
    },
    body: {
        document: {
          mimeType: 'application/pdf',
          content: fileContent
        }
      }
  };
  return z.request(options)
    .then((response) => {
      response.throwForStatus();
      const result = response.json;
    // Get all of the document text as one big string
    const {text} = result;
    // Extract shards from the text field
    const getText = textAnchor => {
      // First shard in document doesn't have startIndex property
      const startIndex = textAnchor.textSegments[0].startIndex || 0;
      const endIndex = textAnchor.textSegments[0].endIndex;
      return text.substring(startIndex, endIndex);
    };
/*    // Process the output
    const [page1] = result.pages;
    const {formFields} = page1;
    var fieldList = "";
    for (const field of formFields) {
      var fieldName = getText(field.fieldName.textAnchor);
      var fieldValue = getText(field.fieldValue.textAnchor);
      fieldName = fieldName.replace(/\n/g,'');
      fieldValue = fieldValue.replace(/\n/g,'');
      fieldList += `"${fieldName}": "${fieldValue}"`;
    z.console.log(`\t(${fieldName}, ${fieldValue})`);
    }
*/
  //z.console.log(fieldList)
      return {getText};

    });
 });

Upvotes: 0

Views: 1325

Answers (2)

Holt Skinner
Holt Skinner

Reputation: 2234

The Document AI Documentation has been updated to include base64 encoding conversion for the Node.js samples

https://cloud.google.com/document-ai/docs/process-documents-client-libraries#client-libraries-usage-nodejs

You can also check out this Codelab for the Form Parser using Node.js. Most of the actual processing request will be the same for every processor.

https://codelabs.developers.google.com/codelabs/docai-form-parser-node#7

Upvotes: 0

Ricco D
Ricco D

Reputation: 7287

It looks like the "content" you used on your request is not in base64. If you are using Linux, you can use the command base64.

base64 your_pdf_to_use.pdf > base64_of_your_pdf.txt

Or you can just use any base64 converter. I saw this online pdf to base64 converter and it works for me as well.

When checking base64 output it should not have any recognizable text/words. I tried using the sample file in the documentAI quickstart. Here is a snippet of a base64 output.

JVBERi0xLjUKJb/3ov4KMiAwIG9iago8PCAvTGluZWFyaXplZCAxIC9MIDI5MDUxIC9IIFsgNzky IDEzNCBdIC9PIDYgL0UgMjg3NzYgL04gMSAvVCAyODc3NSA+PgplbmRvYmoKICAgICAgICAgICAg ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAKMyAwIG9iago8PCAv VHlwZSAvWFJlZiAvTGVuZ3RoIDcwIC9GaWx0ZXIgL0ZsYXRlRGVjb2RlIC9EZWNvZGVQYXJtcyA8 PCAvQ29sdW1ucyA0IC9QcmVkaWN0b3IgMTIgPj4gL1cgWyAxIDIgMSBdIC9JbmRleCBbIDIgMzAg XSAvSW5mbyAxNyAwIFIgL1Jvb3QgNCAwIFIgL1NpemUgMzIgL1ByZXYgMjg3NzYgICAgICAgICAg ICAgICAgIC9JRCBbPGFiYjQ5MjJhYTY5N2NmZDJiODVjYjY5YjNhZGI4MDZmPjxhYmI0OTIyYWE2 OTdjZmQyYjg1Y2I2OWIzYWRiODA2Zj5dID4+CnN0cmVhbQp4nGNiZOBnYGJgOAkkmPiABKMRiNsG YjEACcHDQELhCEhWBkiICYIkpgEJ9ocgliGQEAErrmBgYpwqAdLLwEgxAQD5KwddCmVuZHN0cmVh bQplbmRvYmoKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg....

A snippet of my request.json:

{
  "document": {
    "mimeType": "application/pdf",
    "content": "JVBERi0xLjUKJb/3ov4KMiAwIG9iago8PCAvTGluZWFyaXplZCAxIC9MIDI5MDUxIC9IIFsgNzky
IDEzNCBdIC9PIDYgL0UgMjg3NzYgL04gMSAvVCAyODc3NSA+PgplbmRvYmoKICAgICAgICAgICAg
ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg
ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAKMyAwIG9iago8PCAv
VHlwZSAvWFJlZiAvTGVuZ3RoIDcwIC9GaWx0ZXIgL0ZsYXRlRGVjb2RlIC9EZWNvZGVQYXJtcyA8
PCAvQ29sdW1ucyA0IC9QcmVkaWN0b3IgMTIgPj4gL1cgWyAxIDIgMSBdIC9JbmRleCBbIDIgMzAg
XSAvSW5mbyAxNyAwIFIgL1Jvb3QgNCAwIFIgL1NpemUgMzIgL1ByZXYgMjg3NzYgICAgICAgICAg
ICAgICAgIC9JRCBbPGFiYjQ5MjJhYTY5N2NmZDJiODVjYjY5YjNhZGI4MDZmPjxhYmI0OTIyYWE2
OTdjZmQyYjg1Y2I2OWIzYWRiODA2Zj5dID4+CnN0cmVhbQp4nGNiZOBnYGJgOAkkmPiABKMRiNsG...
}
}

Curl request:

curl -X POST -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) -H "Content-Type: application/json; charset=utf-8" -d @request.json https://us-documentai.googleapis.com/v1beta3/projects/xxxxxxx/locations/us/processors/xxxxxx:process > result.json

Here is the snippet of the output when I used the file from the quick start using endpoint form parser:

enter image description here

EDIT: 20210106

I did try accessing the file using GET and I got the base64 value cleanly using your current request in urls. But I found a SO post about converting files to base64 and says to

add encoding: null on request options so that you will surely receive a Buffer and not a String

Adding encoding: null worked for me as well. It is worth a shot.

Here is a snippet of my code for GET and encode to base64:

    const request_img = require('request');
    const urls = {
          url: 'https://storage.googleapis.com/cloud-samples-data/documentai/loan_form.pdf',
          method: 'GET',
          encoding: null,
          headers: {
            'Accept': 'application/pdf',
            'raw': true
          }
        }
        var urlArr = [];
        urlArr.push(urls);
        
        request_img(urlArr[0], function(err, res, body) {
           var converted_to_base64 = Buffer.from(body).toString('base64');
           console.log(converted_to_base64);
                  });

Here is the snippet of the output. I got the file encoded to base64:

JVBERi0xLjUKJb/3ov4KMiAwIG9iago8PCAvTGluZWFyaXplZCAxIC9MIDI5MDUxIC9IIFsgNzkyIDEzNCBdIC9PIDYgL0UgMjg3NzYgL04gMSAvVCAyODc3NSA+PgplbmRvYmoKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAKMyAwIG9iago8PCAvVHlwZSAvWFJlZiAvTGVuZ3RoIDcwIC9GaWx0ZXIgL0ZsYXRlRGVjb2RlIC9EZWNvZGVQYXJtcyA8PCAvQ29sdW1ucyA0IC9QcmVkaWN0b3IgMTIgPj4gL1cgWyAxIDIgMSBdIC9JbmRleCBbIDIgMzAgXSAvSW5mbyAxNyAwIFIgL1Jvb3QgNCAwIFIgL1NpemUgMzIgL1ByZXYgMjg3NzY

By the way the version of my NodeJS is v10.14.2

Upvotes: 0

Related Questions