Reputation: 302
I am looking for an automated 'pythonic' solution of automating the creation of search index as a source to my RAG tool. It is the same question as here: https://learn.microsoft.com/en-us/answers/questions/2130086/how-to-create-an-index-and-indexer-for-rag-solutio?comment=question#newest-question-comment .
As there are many docs / sources I just wanted to ask which approach is better for this case - using LangChain chunking like here: https://github.com/Azure/azure-search-vector-samples/blob/main/demo-python/code/data-chunking/langchain-data-chunking-example.ipynb
OR the SplitSkill with 'pages' like here:
What is the difference? Does SplitSkill treat documents in the similiar way like the Langchain RecursiveTextSplitter?
Upvotes: 0
Views: 189
Reputation: 1775
The sample code in langchain-data-chunking-example.ipynb used chunk_size and chunk_overlap params for langchain.text_splitter.
from langchain.text_splitter import RecursiveCharacterTextSplitter
from lib.common import get_encoding_name
# from_tiktoken_encoder enables use to split on tokens rather than characters
recursive_text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
encoding_name=get_encoding_name(),
chunk_size=600,
chunk_overlap=125
)
The sample code in Tutorial-rag.ipynb used text_split_mode, maximum_page_length, page_overlap_length params for SplitSkill. The SplitSkill is just a built in endpoint from Microsoft to carry out document chunking.
split_skill = SplitSkill(
description="Split skill to chunk documents",
text_split_mode="pages",
context="/document",
maximum_page_length=2000,
page_overlap_length=500,
inputs=[
InputFieldMappingEntry(name="text", source="/document/content"),
],
outputs=[
OutputFieldMappingEntry(name="textItems", target_name="pages")
],
)
Comparing the examples, you can see both of them used (chunk_size / maximum_page_length) and overlap. SplitSkill also specified to split based on page (it can also be sentences, https://learn.microsoft.com/en-us/python/api/azure-search-documents/azure.search.documents.indexes.models.splitskill?view=azure-python)
Either solution will break a large document into smaller chunks for you. Best to test out the parameters from SplitSkill and LangChain to see which chunking setting can give better search result (for later vector search step). For example, if the chunk_size is too small, it might not group meaningful text blocks together.
My suggestion is to start with what is easiest for your application, then you can tune the parameters to improve. Leave more customization to the last if the result still needs improvement. (i.e. if you are using AI search already, might go with SplitSkill. if you are not using AI search or need high level of customization, can explore LangChain if really required)
p.s. i tried to look up the source code of the server side implementation of SplitSkill API, seems like it is not on github. the AI Search skill repo only contains http client to call server side.
A few more links for references: https://github.com/MicrosoftDocs/azure-ai-docs/blob/main/articles/search/vector-search-how-to-chunk-documents.md
https://learn.microsoft.com/en-us/azure/search/cognitive-search-working-with-skillsets
Upvotes: 0