Reputation: 622
I am trying to use the unstructured library to convert a word document into a json file. However, for some reason it is not seeing the images; in the list of elements that are returned there should be elements of type "Image". It is not throwing an error, it's just not returning the image elements. Below my code and my test file. The testfile contains a string, an image and another string. But the image is thus not detected. What am I doing wrong?
from unstructured.partition.docx import partition_docx
import os
# Set environment variables
os.environ['UNSTRUCTURED_API_KEY'] = "your unstructured.io api key"
os.environ['UNSTRUCTURED_API_URL'] = "https://api.unstructuredapp.io/general/v0/general"
elements = partition_docx(filename="input/test.docx")
with open("input/test.docx", "rb") as f:
elements = partition_docx(file=f)
elements = [element.to_dict() for element in elements]
# save as json
with open("output/test.json", "w") as f_json:
json.dump(elements, f_json, indent=2)
My project structure:
├── root
│ └── input
│ └── output
Here's the file: test.docx
Upvotes: 1
Views: 149
Reputation: 622
I didn't find a way using the partition_docx option. But using the ingest library from unstructured it did work. The key is this argument: "extract_image_block_types": ["Image"]. So for all other folks out there struggling with this, give this a try:
import os
import json
# Set environment variables
os.environ['UNSTRUCTURED_API_KEY'] = "your api key"
os.environ['UNSTRUCTURED_API_URL'] = "https://api.unstructuredapp.io/general/v0/general"
from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig
from unstructured_ingest.v2.processes.connectors.local import (
LocalIndexerConfig,
LocalDownloaderConfig,
LocalConnectionConfig,
LocalUploaderConfig
)
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
if __name__ == "__main__":
Pipeline.from_configs(
context=ProcessorConfig(),
indexer_config=LocalIndexerConfig(input_path="input"),
downloader_config=LocalDownloaderConfig(),
source_connection_config=LocalConnectionConfig(),
partitioner_config=PartitionerConfig(
partition_by_api=True,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
strategy="hi_res",
additional_partition_args={
"split_pdf_page": True,
"split_pdf_allow_failed": True,
"split_pdf_concurrency_level": 15,
"extract_image_block_types": ["Image"]
}
),
uploader_config=LocalUploaderConfig(output_dir="output")
).run()
EDIT: I can't get reliable results. I'm resorting to converting to PDF first and then let unstructured analyze the PDF.
Upvotes: 1