hubatish
hubatish

Reputation: 5250

Sample request json for Vertex AI endpoint?

I've deployed the Llama 3 model using the Deploy button on the Vertex AI model garden Llama 3 card: https://pantheon.corp.google.com/vertex-ai/publishers/meta/model-garden/llama3 Llama 3 card

I can make a request using the "Try out Llama 3" side panel on that page & it seems to be working with my deployed model + endpoint. I'd like to try making a request using Curl or python next. The endpoint UI page also has a "sample request" feature, but it's much less helpful / very generic rather than customized. endpoint sample

So does anyone have an example request (for this model or another)?

Specifically for the JSON instances & parameters. Parameters I also may be able to figure out, but I have absolutely no idea what an instance is in this context? This seems like the closest related question: Sending http request Google Vertex AI end point

..Google Cloud loves naming something generically, not giving that many details on what it is, & then expecting something very specific as a value.

edit: Found the docs on this GCP method: https://cloud.google.com/vertex-ai/docs/reference/rest/v1/projects.locations.endpoints/predict

which gives some description but "The instances that are the input to the prediction call." is not really that helpful.

Upvotes: 1

Views: 1084

Answers (1)

rhaertel80
rhaertel80

Reputation: 8399

Apologies for the poor experience. For now, the best reference is the notebook.

Here's the relevant snippet:

prompt = "What is a car?"  # @param {type: "string"}
max_tokens = 50  # @param {type:"integer"}
temperature = 1.0  # @param {type:"number"}
top_p = 1.0  # @param {type:"number"}
top_k = 1.0  # @param {type:"number"}
raw_response = False  # @param {type:"boolean"}

# Overides parameters for inferences.
# If you encounter the issue like `ServiceUnavailable: 503 Took too long to respond when processing`,
# you can reduce the max length, such as set max_tokens as 20.
instances = [
    {
        "prompt": prompt,
        "max_tokens": max_tokens,
        "temperature": temperature,
        "top_p": top_p,
        "top_k": top_k,
        "raw_response": raw_response
    }
]

But please note that the full JSON (e.g. to send using curl) is:

{
  "instances": [
    {
        "prompt": prompt,
        "max_tokens": max_tokens,
        "temperature": temperature,
        "top_p": top_p,
        "top_k": top_k,
        "raw_response": raw_response
    }
  ]
}

Upvotes: 2

Related Questions