Reputation: 520
I need to extract pairs of code and description from a table with columns but no rows in an image like this:
I tryed Gemini 1.5 Flash, provided the image and the corresponding prompt to the chat and it managed surprisingly well to extract the code-description pairs:
When I tried to create a python program that does the same, I only found documentation to extract the text from the image and then pass the text to the LLM to figure out how to pair the codes (under column "CÓDIGO") and descriptions (under "DESIGNACIÓN DE LA MERCANCÍA"). But since the text is out of context it's impossible for the model to figure out the pairs out of the text alone:
import io
import os
from google.cloud import aiplatform
from google.cloud import language_v1
from google.cloud.vision_v1 import ImageAnnotatorClient
service_account_path = "my_key.json"
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = service_account_path
# Authenticate using the service account key
aiplatform.init(project="my_project", location="us-central1",
credentials=service_account_path)
def process_image(image_path):
client = ImageAnnotatorClient()
with io.open(image_path, 'rb') as image_file:
content = image_file.read()
image = client.document_text_detection(image=content)
text = image.text
prompt = f"""
The following text is extracted from a table in Spanish.
The table has columns: "CÓDIGO", "DESIGNACIÓN DE LA MERCANCÍA".
The "DESIGNACIÓN DE LA MERCANCÍA" column often starts with a hyphen "-".
Extract the data from the text and present it as a list of dictionaries,
where each dictionary has the following structure:
{{"CÓDIGO": "code_value", "DESIGNACIÓN DE LA MERCANCÍA": "description"}}
Text:
{text}
"""
client = language_v1.LanguageServiceClient()
document = language_v1.Document(
content=prompt, type_=language_v1.Document.Type.PLAIN_TEXT
)
response = client.analyze_sentiment(document=document)
return response.document_sentiment.text.split("\n")[1:-1]
Is there a way to get this or another GenAI to receive an image and process the prompt to extract the data the same way the chat version of Gemini 1.5 Flash did?
Upvotes: 0
Views: 180
Reputation: 520
It turned out that I was not using the most recent libraries to interact with Gemini. I should be using "google-generativeai" instead of "google-cloud-*".
Since the new model is multi-modal, you can upload images and use them directly as input to the request to the model, like this:
import os
import PIL.Image
import google.generativeai as genai
with open('GEMINI_API_KEY.txt', 'r') as file:
os.environ['GEMINI_API_KEY'] = file.read()
genai.configure(api_key=os.environ['GEMINI_API_KEY'])
image_path = '/path/images'
image_path_1 = f"{image_path}/01.png"
image_path_2 = f"{image_path}/02.png"
sample_file_1 = PIL.Image.open(image_path_1)
sample_file_2 = PIL.Image.open(image_path_2)
#Choose a Gemini model.
model = genai.GenerativeModel(model_name="gemini-1.5-flash")
prompt = "..."
# Pass the prompt and the images
response = model.generate_content([prompt, sample_file_1, sample_file_2])
print(response.text)
Upvotes: 0