Reputation: 546
I want to train and test custom document Classifier using Python Code and I found this train Processor. And I started implementing using this Documentation. But I am getting one error when I call function
train_processor_version_sample(497857003374, 'us','a530739de44a7ca6',"Version-1","gs://documentai-bucket-123/pdfs","gs://documentai-bucket-123/test")
error:
InvalidArgument Traceback (most recent call last)
Input In [22], in <cell line: 1>()
----> 1 train_processor_version_sample(497857003374, 'us','a530739de44a7ca6',"Version-1","gs://documentai-bucket-123/pdfs","gs://documentai-bucket-123/test")
Input In [17], in train_processor_version_sample(project_id, location, processor_id, processor_version_display_name, train_data_uri, test_data_uri)
52 print(operation.operation.name)
53 # Wait for operation to complete
---> 54 response = documentai.TrainProcessorVersionResponse(operation.result())
56 metadata = documentai.TrainProcessorVersionMetadata(operation.metadata)
58 print(f"New Processor Version:{response.processor_version}")
File ~/anaconda3/lib/python3.9/site-packages/google/api_core/future/polling.py:261, in PollingFuture.result(self, timeout, retry, polling)
256 self._blocking_poll(timeout=timeout, retry=retry, polling=polling)
258 if self._exception is not None:
259 # pylint: disable=raising-bad-type
260 # Pylint doesn't recognize that this is valid in this case.
--> 261 raise self._exception
263 return self._result
InvalidArgument: 400 Invalid dataset. See operation metadata for specific errors
I have some idea about this. It is because custom document classifier have some training dataset requiements
Training guidelines
Minimum 2 labels required in the schema
Each label exists on 10 training documents
Each label exists on 2 test documents
I don't know how to get labeled dataset url and pass two bucket directory for training and test set too using python code. Can Anyone help me on this?
Upvotes: 0
Views: 264
Reputation: 2234
This answer should cover your use case.
The code sample you linked requires the Document.JSON files in Google Cloud Storage to be labeled already.
There's not a public API to explicitly label documents, the recommended process is to use the Cloud Console to create the labeled data, then you can use the training API to trigger the training process.
Upvotes: 0