Training llm for Query Generation in a Graph Database

If I have developed a graph database which has its own query language. I have to find a way to feed llm the graph and then llm should be able to generate the queries of our database.

I have found something similar in langchain that we can feed it the rdf file and then it will generate the sparql queries.

So I have many doubts regarding this as I am very new to this:

Is it possible to train a llm on an entirely new technology like here it is our database. If it is possible then how.

I know that we have to provide the training data to the llm. So in this case, will it be the dataset with our database queries. If yes , then how many queries we have to provide in a dataset.

Sorry if the question is not detailed , its only my second time asking here.

Upvotes: 4

Answers (1)

Thomas

Reputation: 1010

Background

It sounds like you're trying to integrate the LLM in a nl2SomeLanguage fashion. Examples would be nl2cypher, nl2sparql, nl2sql, etc - but there isn't a project out there that currently has this functionality for your language.

In this architecture, you're not "feeding the graph to the LLM". Your prompt contains the schema and the question ("Write a SPARQL query to return all xyz's, given this graph schema"). This relies on the LLM that you're using to understand the query language in question - which it currently doesn't. This means some extra work for you.

It sounds like you're already familiar with the RDF Langchain graph. Check out the source to get more familiar with how langchain does it. Then look at all the other graphs they support and you'll see the same pattern of extracting the schema, etc..

Thoughts

You generally can't "train" out of the box LLM's like the GPT family. This would involve adjusting weights, which isn't practically possible. Fine tuning is also a little difficult, because that's really just adjusting some knobs and controls on the LLM to change how it responds. Fine tuning doesn't train the model on your data. RAG also doesn't work for you here, because you're not doing document retrieval!

For a novel language model that understands something that existing models don't. You probably need to "train a model from scratch" (this is the keyword you should be putting in google) which is both time consuming and very expensive. This repository has a good overview on how to train the Llama model from scratch.

So in this case, will it be the dataset with our database queries. If yes , then how many queries we have to provide in a dataset

This is for you to determine ;) It will depend on a number of factors such as the complexity of your schema, complexity of the questions being asked, variability of how the questions are being asked, complexity of the query language, and what level of accuracy your product requires.

Conclusion

You can try fine tuning existing models, but my expectations and hopes are pretty low that you'll get the results you need
You'll most likely need to "train a model from scratch", which is a technical, time consuming, and expensive endeavor
Don't get caught up in RAG or document search for this