ThaNoob
ThaNoob

Reputation: 622

unstructured cannot find images

I am trying to use the unstructured library to convert a word document into a json file. However, for some reason it is not seeing the images; in the list of elements that are returned there should be elements of type "Image". It is not throwing an error, it's just not returning the image elements. Below my code and my test file. The testfile contains a string, an image and another string. But the image is thus not detected. What am I doing wrong?

from unstructured.partition.docx import partition_docx
import os
# Set environment variables
os.environ['UNSTRUCTURED_API_KEY'] = "your unstructured.io api key"
os.environ['UNSTRUCTURED_API_URL'] = "https://api.unstructuredapp.io/general/v0/general"

elements = partition_docx(filename="input/test.docx")

with open("input/test.docx", "rb") as f:
    elements = partition_docx(file=f)
    elements = [element.to_dict() for element in elements]
    # save as json
    with open("output/test.json", "w") as f_json:
        json.dump(elements, f_json, indent=2)

My project structure:

├── root
│   └── input
│   └── output

Here's the file: test.docx

Upvotes: 1

Views: 149

Answers (1)

ThaNoob
ThaNoob

Reputation: 622

I didn't find a way using the partition_docx option. But using the ingest library from unstructured it did work. The key is this argument: "extract_image_block_types": ["Image"]. So for all other folks out there struggling with this, give this a try:

import os
import json

# Set environment variables
os.environ['UNSTRUCTURED_API_KEY'] = "your api key"
os.environ['UNSTRUCTURED_API_URL'] = "https://api.unstructuredapp.io/general/v0/general"

from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig
from unstructured_ingest.v2.processes.connectors.local import (
    LocalIndexerConfig,
    LocalDownloaderConfig,
    LocalConnectionConfig,
    LocalUploaderConfig
)
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig


if __name__ == "__main__":

    Pipeline.from_configs(
        context=ProcessorConfig(),
        indexer_config=LocalIndexerConfig(input_path="input"),
        downloader_config=LocalDownloaderConfig(),
        source_connection_config=LocalConnectionConfig(),
        partitioner_config=PartitionerConfig(
            partition_by_api=True,
            api_key=os.getenv("UNSTRUCTURED_API_KEY"),
            partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
            strategy="hi_res",
            additional_partition_args={
                "split_pdf_page": True,
                "split_pdf_allow_failed": True,
                "split_pdf_concurrency_level": 15,
                "extract_image_block_types": ["Image"]
            }
        ),
        uploader_config=LocalUploaderConfig(output_dir="output")
    ).run()

EDIT: I can't get reliable results. I'm resorting to converting to PDF first and then let unstructured analyze the PDF.

Upvotes: 1

Related Questions