Reputation: 49
I just deployed the Nous-Hermes-Llama2-70b parameter on a 2x Nvidia A100 GPU through the Hugging Face Inference endpoints.
When I tried the following code, the response generations were incomplete sentences that were less than 1 line long.
import requests
API_URL = 'https://myendpoint.us-east-1.aws.endpoints.huggingface.cloud'
headers = {
"Authorization": "Bearer mytoken1234",
"Content-Type": "application/json"
}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
output = query({
"inputs": "### Instruction:\r\nCome up with a joke about cats\r\n### Response:\r\n",
})
The output in this case was:
"Why don't cats play poker in the jungle?
Because "
As you see, the response stopped after 9 words.
Do I need to add more headers to the request like temperature and max token length? How would I do that? What do I need to do to get normal, long responses?
Here is the model I'm using: https://huggingface.co/NousResearch/Nous-Hermes-Llama2-70b
Upvotes: 0
Views: 956