Reputation: 2410
I successfully embedded a 400-page PDF document within 1-2 hours. However, when I tried to embed a CSV file with about 40k rows and only one column, the estimated embedding time is approximately 24 hours.
Here is the code I used:
embedder = OllamaEmbeddings(model="nomic-embed-text", show_progress=True)
file_path = 'filtered_combined_info.csv'
loader = CSVLoader(
file_path=file_path,
encoding='utf-8', # or 'ISO-8859-1' if utf-8 doesn't work
autodetect_encoding=False # Set to True if you want to attempt autodetection
)
data = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
docs = text_splitter.split_documents(data)
persist_directory = 'db'
vectordb = Chroma.from_documents(documents=docs,
embedding=embedder,
persist_directory=persist_directory)
Why is the embedding process for the CSV file taking significantly longer than for the PDF file? Are there any optimizations or changes I can make to reduce the embedding time for the CSV file?
Additionally, is there anything I am doing wrong that might be causing it to take so much time?
Upvotes: 0
Views: 148
Reputation: 2410
I removed everything of Ollama that i installed in my local machine, and moved the installation to the docker.
First start the docker and run the following:
docker run -d --rm -v ./ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
and then install
docker exec -it ollama ollama run nomic-embed-text
Now use same like this before:
embedder = OllamaEmbeddings(model="nomic-embed-text",
show_progress=True)
Check the difference:
I don't know how installing on the docker seems increasing the speed but my guess is changing from windows (my machine) to docker Linux worked?
Upvotes: 0