Vardhan Gupta
Vardhan Gupta

Reputation: 145

langchain: how to use a custom embedding model locally?

I am trying to use a custom embedding model in Langchain with chromaDB. I can't seem to find a way to use the base embedding class without having to use some other provider (like OpenAIEmbeddings or HuggingFaceEmbeddings). Am I missing something?

On the Langchain page it says that the base Embeddings class in LangChain provides two methods: one for embedding documents and one for embedding a query. so I figured there must be a way to create another class on top of this class and overwrite/implement those methods with our own methods. But how do I do that?

I tried to somehow use the base embeddings class but am unable to create a new embedding object/class on top of it.

Upvotes: 13

Views: 35237

Answers (8)

Jon M
Jon M

Reputation: 674

In order to use embeddings with something like langchain, you need to include the embed_documents and embed_query methods. Otherwise, routines such as

Like so...

from sentence_transformers import SentenceTransformer
from typing import List
    
class MyEmbeddings:
    def __init__(self, model):
        self.model = SentenceTransformer(model, trust_remote_code=True)
    
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        return [self.model.encode(t).tolist() for t in texts]
            
    def embed_query(self, query: str) -> List[float]:
        return self.model.encode([query]).tolist()
    
#...
    
embeddings=MyEmbeddings('your model name') # e.g. "sentence-transformers/all-MiniLM-L6-v2"

chromadb = Chroma.from_documents(
    documents=your_docs,
    embedding=embeddings,
)

Upvotes: 11

Mohssine SERRAJI
Mohssine SERRAJI

Reputation: 21

To use a custom embedding model locally in LangChain, you can create a subclass of the Embeddings base class and implement the embed_documents and embed_query methods using your preferred embedding model. Below, I'll show you how to use a local embedding model with LangChain using the SentenceTransformer library.

from langchain_core.embeddings import Embeddings
from sentence_transformers import SentenceTransformer
from typing import List

class MyEmbeddings(Embeddings):
    def __init__(self, model):
        self.model = SentenceTransformer(model, trust_remote_code=True)

    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        return [self.model.encode(t).tolist() for t in texts]
    
    def embed_query(self, query: str) -> List[float]:
        return self.model.encode([query])
    
    def aembed_documents(self, texts):
         return super().aembed_documents(texts)
    def aembed_query(self, text):
         return super().aembed_query(text)

use it with Chroma

vector_store = Chroma(
    "faq_collection",
    embedding_function=embedding_function,
    persist_directory="./faq_persist",
    collection_metadata={"hnsw:space": "cosine"},
    
)

Upvotes: 0

張育嘉
張育嘉

Reputation: 1

I had just tried to do this and finally did it after reading all the answers of this question.

Before my explaination and code, I must say that the document of Chroma is quite useless!!! It cannot help me to use my local finetuned model and Chroma.from_document() to build a vector database successfully.

The explaination starts.

The code from Jon M is almostly correct, and there is just a little incorrect thing in this code. The only two wrong things of this code are that the input of self.model.encode() should be string rather than list of string and the type of the output of self.model.encode() is numpy array rather than list of float. So, the correct code should be something like this:

from typing import List
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.vectorstores import Chroma
from sentence_transformers import SentenceTransformer

class MyEmbedding:
    def __init__(self, model):
        self.model = SentenceTransformer(model, trust_remote_code=True)

    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        return [self.model.encode(text).tolist() for text in texts]

    def embed_query(self, query: str) -> List[float]:
        encoded_query = self.model.encode(query)
        return encoded_query.tolist()

database_path = 'your_vectorDB_path'

# Input your sentences in List[str] type and relative path of your own local model.
def set_vector_db(texts, model_path):
    text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=40)

    chunks = text_splitter.create_documents(texts)

    embedding_model = MyEmbedding(model_path)

    os.makedirs(database_path)

    chromadb = Chroma.from_documents(chunks, 
                                     embedding=embedding_model,
                                     collection_name='coll_cosine',
                                     collection_metadata={"hnsw:space": "cosine"},
                                     persist_directory=database_path)
    chromadb.persist()

# Input your question in string type and the relative path of your own local model.
def retrieve(user_query, model_path):
    embedding_model = MyEmbedding(model_path)

    chromadb = Chroma(embedding_function=embedding_model,
                      collection_name='coll_cosine',
                      collection_metadata={"hnsw:space": "cosine"},
                      persist_directory=database_path)

    results = chromadb.similarity_search_with_score(user_query, 10)

    return results[0][0].page_content

The main point is that the return of embed_query() should be self.model.encode(query).tolist().

Upvotes: 0

paarandika
paarandika

Reputation: 1439

You can create your own class and implement the methods such as embed_documents. If you strictly adhere to typing you can extend the Embeddings class (from langchain_core.embeddings.embeddings import Embeddings) and implement the abstract methods there. You can find the class implementation here.

Below is a small working custom embedding class I used with semantic chunking.

from sentence_transformers import SentenceTransformer
from langchain_experimental.text_splitter import SemanticChunker
from typing import List


class MyEmbeddings:
    def __init__(self):
        self.model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        return [self.model.encode(t).tolist() for t in texts]


embeddings = MyEmbeddings()

splitter = SemanticChunker(embeddings)

Upvotes: 7

Sam Marvasti
Sam Marvasti

Reputation: 128

When you initially run HuggingFaceEmbeddings, it downloads the model. Subsequent runs don't require an internet connection and run locally depending on the model. An excellent illustration of this is the privateGPT project or this modified version, which allows you to utilize AzureOpenAI.

You can find further information on this at: GitHub - MarvsaiDev/privateGPT. You can access AzureOpenAI for free by setting up an Azure account. gtr-t5-large runs locally. This model BAAI/bge-large-en-v1.5 also runs locally but requires GPU.

A few questions also:

Have you had experience working with Python before? I am not sure I want to give you a run down on python but LangChain is using Builder patterns in python. I'd recommend avoiding LangChain as it tends to be overly complex and slow. Once you've clarified your requirements, it's often more efficient to write the code directly. Nowdays most LLM accpet openAI api. Regarding your question about not using HuggingFaceEmbeddings:

HuggingFaceEmbeddings has proven to be reliable and efficient for local use in my experience.

Upvotes: 2

Kamlesh
Kamlesh

Reputation: 529

Check below code :

import chromadb 
from langchain.embeddings.openai import OpenAIEmbeddings
from chromadb.config import Settings

class LangchainService:

    def __init__(self, path= './database'):
        self.__model_name = "gpt-3.5-turbo"
        self.__path = path
        self.__persistent_client = 
        chromadb.PersistentClient(path=self.__path, settings= Settings())
        self.__openai_ef = embedding_functions.OpenAIEmbeddingFunction(api_key=OPENAI_API_KEY, model_name="text-embedding-ada-002")


    def get_or_create_document_collection(self, collection_name = "collection_name"):    
        collection = self.__persistent_client.get_or_create_collection(collection_name, embedding_function= self.__openai_ef)
        return collection

langService = LangchainService()
document_collection = langService.get_or_create_document_collection(collecton_name="document_collection")

Upvotes: -1

Jay Ghiya
Jay Ghiya

Reputation: 504

from langchain.embeddings import HuggingFaceEmbeddings


modelPath = "BAAI/bge-large-en-v1.5"

# Create a dictionary with model configuration options, specifying to use the CPU for computations
model_kwargs = {'device':'cpu'}

#if using apple m1/m2 -> use device : mps (this will use apple metal)

# Create a dictionary with encoding options, specifically setting 'normalize_embeddings' to False
encode_kwargs = {'normalize_embeddings': True}

# Initialize an instance of HuggingFaceEmbeddings with the specified parameters
embeddings = HuggingFaceEmbeddings(
    model_name=modelPath,     # Provide the pre-trained model's path
    model_kwargs=model_kwargs, # Pass the model configuration options
    encode_kwargs=encode_kwargs # Pass the encoding options
)

This is how you could use it locally. it will download the model one time. Then you could go ahead and use.

db = Chroma.from_documents(texts, embedding=embeddings)
retriever = db.as_retriever(
    search_type="mmr",  # Also test "similarity"
    search_kwargs={"k": 20},
)

Upvotes: 5

ALSED404
ALSED404

Reputation: 1

I believe I have some ideas that may help you with using the custom embedding model in Langchain with chromaDB. If I understand correctly, you need to create a new class that re-implements some methods in the base embedding class. I have two suggestions on how to achieve this.

The first method involves using class inheritance in the programming language you are using. You can create a new class that inherits from the base embedding class and override specific methods according to your needs. For example, you can create a new class that inherits from the base class and customizes your user interface.

The second method involves using variable and method overriding in the base class. You can create an object of the base class and override the methods you want to change, then call the other methods in the base class through your new object.

As for how to do this in more detail, it will depend on the programming language you are using and the library that Langchain relies on. It might be best to refer to the Langchain documentation and available examples for further assistance in customizing your embedding.

I hope this helps in general, and if you need more specific assistance for your particular case, please specify the tools and libraries you are using, and I will try to help you better.

Upvotes: -7

Related Questions