Llama 70b on Hugging Face Inference API Endpoint short responses

Question

I just deployed the Nous-Hermes-Llama2-70b parameter on a 2x Nvidia A100 GPU through the Hugging Face Inference endpoints.

When I tried the following code, the response generations were incomplete sentences that were less than 1 line long.

import requests

API_URL = 'https://myendpoint.us-east-1.aws.endpoints.huggingface.cloud'
headers = {
  "Authorization": "Bearer mytoken1234",
  "Content-Type": "application/json"
}

def query(payload):
  response = requests.post(API_URL, headers=headers, json=payload)
  return response.json()
 
output = query({
  "inputs": "### Instruction:
Come up with a joke about cats
### Response:
",
})

The output in this case was:

"Why don't cats play poker in the jungle?

 Because "

As you see, the response stopped after 9 words.

Do I need to add more headers to the request like temperature and max token length? How would I do that? What do I need to do to get normal, long responses?

Here is the model I'm using: https://huggingface.co/NousResearch/Nous-Hermes-Llama2-70b

Jimmison Johnson · Accepted Answer

Added "max_new_tokens" => 256 as a parameter, fixed it.

Llama 70b on Hugging Face Inference API Endpoint short responses

Answers (1)

Related Questions