How to get Gemini 1.5 to extract tabular data from an image?

Question

I need to extract pairs of code and description from a table with columns but no rows in an image like this:

I tryed Gemini 1.5 Flash, provided the image and the corresponding prompt to the chat and it managed surprisingly well to extract the code-description pairs:

When I tried to create a python program that does the same, I only found documentation to extract the text from the image and then pass the text to the LLM to figure out how to pair the codes (under column "CÓDIGO") and descriptions (under "DESIGNACIÓN DE LA MERCANCÍA"). But since the text is out of context it's impossible for the model to figure out the pairs out of the text alone:

import io
import os
from google.cloud import aiplatform
from google.cloud import language_v1
from google.cloud.vision_v1 import ImageAnnotatorClient

service_account_path = "my_key.json"
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = service_account_path
# Authenticate using the service account key
aiplatform.init(project="my_project", location="us-central1", 
                credentials=service_account_path) 

def process_image(image_path):
    client = ImageAnnotatorClient()
    with io.open(image_path, 'rb') as image_file:
        content = image_file.read()
    image = client.document_text_detection(image=content)
    text = image.text
    prompt = f"""
    The following text is extracted from a table in Spanish. 
    The table has columns: "CÓDIGO", "DESIGNACIÓN DE LA MERCANCÍA". 
    The "DESIGNACIÓN DE LA MERCANCÍA" column often starts with a hyphen "-". 

    Extract the data from the text and present it as a list of dictionaries, 
    where each dictionary has the following structure:
    {{"CÓDIGO": "code_value", "DESIGNACIÓN DE LA MERCANCÍA": "description"}}

    Text: 
    {text}
    """

    client = language_v1.LanguageServiceClient()
    document = language_v1.Document(
        content=prompt, type_=language_v1.Document.Type.PLAIN_TEXT
    )
    response = client.analyze_sentiment(document=document)
    return response.document_sentiment.text.split("
")[1:-1]

Is there a way to get this or another GenAI to receive an image and process the prompt to extract the data the same way the chat version of Gemini 1.5 Flash did?

How to get Gemini 1.5 to extract tabular data from an image?

Answers (1)

Related Questions