Lorenzo Cutrupi
Lorenzo Cutrupi

Reputation: 720

RAG model not reading json files

I'm trying to implement a simple rag that reads a list of input files and answers to questions based on their content:

documents = SimpleDirectoryReader("/content/Data/").load_data()
llm = LlamaCPP(
    model_url='https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf',
    model_path=None,
    temperature=0.1,
    max_new_tokens=256,
    context_window=3900,
    generate_kwargs={},
    model_kwargs={"n_gpu_layers": -1},
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)
embed_model = HuggingFaceEmbeddings(
    model_name="thenlper/gte-large"
)
service_context = ServiceContext.from_defaults(
    chunk_size=256,
    llm=llm,
    embed_model=embed_model
)
index = VectorStoreIndex.from_documents(documents, service_context=service_context)
query_engine = index.as_query_engine()
response = query_engine.query("What is the quantity of Nokia 3310 available?")

But I noticed that the model is not able to answer to questions regarding the json files within the Data folder, while it's great for pdf. Why does it happen and how can I solve? I notice that documents contains the json too, so I think it's not related to the first line of code but probably to the one for index. Thank you in advance, if you need more information ask me

Upvotes: 0

Views: 1133

Answers (1)

Tarumi
Tarumi

Reputation: 86

It looks like you're using llama_index library, with a simple search on the used method SimpleDirectoryReader you will find the supported files extensions.

.csv - comma-separated values
.docx - Microsoft Word
.epub - EPUB ebook format
.hwp - Hangul Word Processor
.ipynb - Jupyter Notebook
.jpeg, .jpg - JPEG image
.mbox - MBOX email archive
.md - Markdown
.mp3, .mp4 - audio and video
.pdf - Portable Document Format
.png - Portable Network Graphics
.ppt, .pptm, .pptx - Microsoft PowerPoint

In the documentation you will also find a link to a specific JSON reader.

You may want to look what is inside your documents variable, make sure that they are intelligible.

FYI, in the provided code you're not even using LLM yet. You are simply querying your vector database to find the most similar documents to "What is the quantity of Nokia 3310 available?".

Upvotes: 1

Related Questions