Reputation: 1
I am encountering a problem with ChatGPT API to detect objects in an image, I am not sure if this API is incapable of doing so. I tried to check OpenAI docs and it seems like he is able of doing so : https://platform.openai.com/docs/guides/vision?lang=node
this is the following response I get from API :
console.log response from API: I'm sorry but as an AI text-based model, I don't have the capability to see or analyze images. Could you provide a textual description of the items?
Does someone could clarify it to me?
I expected to get a list of items' names and their quantity to use for Edamam API.
This is the following code:
// openAI.test.ts
import env from "dotenv";
// import path from "path";
import { fetchChatCompletion, Message } from "../../services/openAI";
env.config();
// const filePath = path.join(__dirname, "../public/images.jpeg");
const imageURL =
"https://www.diabetesfoodhub.org/system/user_files/Images/1837-diabetic-pecan-crusted-chicken-breast_JulAug20DF_clean-simple_061720.jpg";
// console.log("imageURL", filePath);
const message: Message[] = [
{
role: "user",
content: [
{
type: "text",
text: "Hello, Could you please provide me then name of each item and the quantity of each item in the image for using it in Edamam API?",
},
{
type: "image_url",
image_url: { url: imageURL },
},
],
},
];
describe("fetchChatCompletion", () => {
test("fetches chat completion successfully", async () => {
try {
const response = await fetchChatCompletion(message);
expect(response).toBeDefined();
console.log("response from API: \n", response);
} catch (error) {
console.error("API Error: ", error.message);
}
}, 30000); // Increase timeout to 30 seconds
});
// openAI.ts
import OpenAI from "openai";
import dotenv from "dotenv";
// import readline from "readline";
dotenv.config();
export interface Message {
role: "user" | "assistant";
content: [
{ type: "text"; text: string },
{ type: "image_url"; image_url: object }
];
}
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
async function fetchChatCompletion(messages: Message[]): Promise<string> {
console.log("messages: \n", messages);
const imageUrl = messages[0].content[1].image_url;
console.log("imageUrl: ", imageUrl);
try {
const response = await openai.chat.completions.create({
model: "gpt-4", // Ensure the model name is correctly specified
messages: messages.map((msg) => ({
role: msg.role,
content: msg.content
.map((contentItem) => {
if (contentItem.type === "text") {
return contentItem.text;
} else if (contentItem.type === "image_url") {
// Handle image URL as text, since ChatGPT can't process images
return `Image URL: ${contentItem.image_url}`;
}
return ""; // Fallback for unknown content types
})
.join(" "), // Combine text and image URL descriptions into a single string
})),
});
const latestResponse = response.choices[0].message.content;
// console.log("latestResponse", latestResponse);
return latestResponse;
} catch (error) {
console.error("API Error: ", error.message);
return error.message;
}
}
export { fetchChatCompletion };
Upvotes: 0
Views: 439
Reputation: 25009
The format of the content that you are passing:
content: msg.content
.map((contentItem) => {
if (contentItem.type === "text") {
return contentItem.text;
} else if (contentItem.type === "image_url") {
// Handle image URL as text, since ChatGPT can't process images
return `Image URL: ${contentItem.image_url}`;
}
return ""; // Fallback for unknown content types
})
.join(" "), // Combine text and image URL descriptions into a single string
Is different from the one in the documentation:
content: [
{ type: "text", text: "What’s in this image?" },
{
type: "image_url",
image_url: {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
},
},
],
The model is also different (gpt-4
vs gpt-4o
). The docs mention the following:
Both GPT-4o and GPT-4 Turbo have vision capabilities, meaning the models can take in images and answer questions about them. Historically, language model systems have been limited by taking in a single input modality, text.
Upvotes: 0