Lucas Azevedo
Lucas Azevedo

Reputation: 2370

How to use a Llama model with langchain? It gives an error: Pipeline cannot infer suitable model classes from: <model_name> - HuggingFace

finetuned a model (https://huggingface.co/decapoda-research/llama-7b-hf) using peft and lora and saved as https://huggingface.co/lucas0/empath-llama-7b. Now im getting Pipeline cannot infer suitable model classes from when trying to use it along with with langchain and chroma vectordb:

from langchain.embeddings import HuggingFaceHubEmbeddings
from langchain import PromptTemplate, HuggingFaceHub, LLMChain
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.vectorstores import Chroma

repo_id = "sentence-transformers/all-mpnet-base-v2"
embedder = HuggingFaceHubEmbeddings(
    repo_id=repo_id,
    task="feature-extraction",
    huggingfacehub_api_token="XXXXX",
)
comments = ["foo", "bar"]
embeddings = embedder.embed_documents(texts=comments)
docsearch = Chroma.from_texts(comments, embedder).as_retriever()
#docsearch = Chroma.from_documents(texts, embeddings)

llm = HuggingFaceHub(repo_id='lucas0/empath-llama-7b', huggingfacehub_api_token='XXXXX')
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=docsearch, return_source_documents=False)

q = input("input your query:")
result = qa.run(query=q)

print(result["result"])

is anyone able to tell me how to fix this? Is it an issue with the model card? I was facing issues with the lack of the config.json file and ended up just placing the same config.json as the model I used as base for the lora fine-tuning. Could that be the origin of the issue? If so, how to generate the correct config.json without having to get the original llama weights?

Also, is there a way of loading several sentences into a custom HF model (not only OpenAi, as the tutorial show) without using vector dbs?

Thanks!


The same issue happens when trying to run the API on the model's HF page:

enter image description here

Upvotes: 2

Views: 12883

Answers (1)

alvas
alvas

Reputation: 122218

Before using the langchain API to the huggingface model, you should try to load the model in Huggingface:

from transformers import AutoModel

model = AutoModel.from_pretrained('lucas0/empath-llama-7b')

And that'll throw some errors:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-2-1b9ce76f5421> in <cell line: 3>()
      1 from transformers import AutoModel
      2 
----> 3 model = AutoModel.from_pretrained('lucas0/empath-llama-7b')

1 frames
/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
   2553                             )
   2554                         else:
-> 2555                             raise EnvironmentError(
   2556                                 f"{pretrained_model_name_or_path} does not appear to have a file named"
   2557                                 f" {_add_variant(WEIGHTS_NAME, variant)}, {TF2_WEIGHTS_NAME}, {TF_WEIGHTS_NAME} or"

OSError: lucas0/empath-llama-7b does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.

Then looking into the model files, it looks like only the adapter model is saved and not the model, https://huggingface.co/lucas0/empath-llama-7b/tree/main, so the Automodel is throwing tantrums.

To load an adapted model, you have to the base model and the peft (adapter model separated, first the installs (restart after installs, if needed):

! pip install -U peft accelerate
! pip install -U sentencepiece
! pip install -U transformers

Then to load the model, take a look at the guanaco example, Trying to install guanaco (pip install guanaco) for a text classification model but getting error (You will need a GPU runtime)

import torch
from peft import PeftModel    
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer

model_name = "decapoda-research/llama-7b-hf"
adapters_name = 'lucas0/empath-llama-7b'

print(f"Starting to load the model {model_name} into memory")

m = AutoModelForCausalLM.from_pretrained(
    model_name,
    #load_in_4bit=True,
    torch_dtype=torch.bfloat16,
    device_map={"": 0}
)
m = PeftModel.from_pretrained(m, adapters_name)
m = m.merge_and_unload()
tok = LlamaTokenizer.from_pretrained(model_name)
tok.bos_token_id = 1

stop_token_ids = [0]

print(f"Successfully loaded the model {model_name} into memory")

Now you can load the model that you've adapted/fine-tuned in Huggingface transformers, you can try it with langchain, before that we have to dig the langchain code, to use a prompt with HF model, users are told to do this:

from langchain import PromptTemplate, LLMChain, HuggingFaceHub

template = """ Hey llama, you like to eat quinoa. Whatever question I ask you, you reply with "Waffles, waffles, waffles!".
 Question: {input} Answer: """
prompt = PromptTemplate(template=template, input_variables=["input"])


model = HuggingFaceHub(repo_id="facebook/mbart-large-50",
                       model_kwargs={"temperature": 0, "max_length":200},
chain = LLMChain(prompt=prompt, llm=model)

But when we look at the HuggingFaceHub object it isn't just a vanilla AutoModel from transformers huggingface.

When we look at https://github.com/hwchase17/langchain/blob/master/langchain/chains/llm.py, we see that it's trying to load the llm=... argument with some wrapper class, so we dig deeper into langchain's HuggingFaceHub object at https://github.com/hwchase17/langchain/blob/master/langchain/llms/huggingface_hub.py

The HuggingFaceHub object wraps over the huggingface_hub.inference_api.InferenceApi for the text-generation, text2text-generation or summarization tasks

And HuggingFaceHub looks like some spaghetti like object that inherits from LLM object https://github.com/hwchase17/langchain/blob/master/langchain/llms/base.py#L453

To summarize this a little, we want to:

  • load a HuggingFaceHub with langchain API,
  • and the HuggingFaceHub is actually a wrapper over the huggingface_hub.inference_api.InferenceApi
  • and the HuggingFaceHub object is a subclass of llm.base.LLM

Given that knowledge on the HuggingFaceHub object, now, we have several options:

Opinion: The easiest way around it is to totally avoid langchain, since it's wrapper around things, you can write your customized wrapper that skip the levels of inheritance created in langchain to wrap around as many tools as it can/need

Ideally: Ask the langchain developer/maintainer to load peft/adapter model and write another subclass for them

Practical:* Lets hack the thing and write our own LLM subclass.


Practical solution:

Lets try to hack up a new LLM subclass

from typing import Any, Dict, List, Mapping, Optional

from pydantic import Extra, root_validator

from langchain.callbacks.manager import CallbackManagerForLLMRun
from langchain.llms.base import LLM
from langchain.llms.utils import enforce_stop_tokens

from langchain import PromptTemplate, LLMChain

class HuggingFaceHugs(LLM):
  pipeline: Any
  class Config:
    """Configuration for this pydantic object."""
    extra = Extra.forbid

  def __init__(self, model, tokenizer, task="text-generation"):
    super().__init__()
    self.pipeline = pipeline(task, model=model, tokenizer=tokenizer)

  @property
  def _llm_type(self) -> str:
    """Return type of llm."""
    return "huggingface_hub"

  def _call(self, prompt, stop: Optional[List[str]] = None, run_manager: Optional[CallbackManagerForLLMRun] = None,):
    # Runt the inference.
    text = self.pipeline(prompt, max_length=100)[0]['generated_text']
    
    # @alvas: I've totally no idea what this in langchain does, so I copied it verbatim.
    if stop is not None:
      # This is a bit hacky, but I can't figure out a better way to enforce
      # stop tokens when making calls to huggingface_hub.
      text = enforce_stop_tokens(text, stop)
    print(text)
    return text[len(prompt):]


template = """ Hey llama, you like to eat quinoa. Whatever question I ask you, you reply with "Waffles, waffles, waffles!".
 Question: {input} Answer: """
prompt = PromptTemplate(template=template, input_variables=["input"])


hf_model = HuggingFaceHugs(model=m, tokenizer=tok)

chain = LLMChain(prompt=prompt, llm=hf_model)

chain("Who is Princess Momo?")

Phew, langchain didn't complain... and here's the output:

{'input': 'Who is Princess Momo?',
 'text': ' She is a princess.  She is a princess.  She is a princess.  She is a princess.  She is a princess.  She is a princess.  She is a princess.  She is'}

Epilogue: Apparently this llama model doesn't understand that all it needs to do is to reply Waffles, waffles, waffles


TL;DR

See https://colab.research.google.com/drive/1l2GiSSPbajVyp2Nk3CFT4t3uH6-5TiBe?usp=sharing

Upvotes: 14

Related Questions