Gheeroppa
Gheeroppa

Reputation: 79

How to store Bag of Words or Embeddings in a Database

I would like to store vector features, like Bag-of-Words or Word-Embedding vectors of a large number of texts, in a dataset, stored in a SQL Database. What're the data structures and the best practices to save and retrieve these features?

Upvotes: 4

Views: 6532

Answers (5)

0asa
0asa

Reputation: 224

Maybe with

Upvotes: 0

Fauzan Taufik
Fauzan Taufik

Reputation: 349

There are databases that are specialized for vector data in machine learning. these are the list.

  1. Milvus https://milvus.io/
  2. Weavviate https://weaviate.io/
  3. AquilaDB https://docs.aquila.network
  4. Pinecone https://www.pinecone.io/

Upvotes: 10

coolflower
coolflower

Reputation: 77

Milvus is an open-source vector database built to power embedding similarity search and AI applications

https://github.com/milvus-io/milvus

I am doing the test

Upvotes: -1

polm23
polm23

Reputation: 15593

Word vectors should generally be stored as BLOBs if possible. If not they can be stored as json arrays. Since the only reasonable operation for word vectors is to look them up by the word key the other details don't particularly matter.

For bag of words you would typically need three columns, this is what it would look like in sqlite.

create table bow (
  doc_id int,
  word text,
  count int)

Where your document IDs come from somewhere else. If you need to you can make (doc_id, word) the key.

However, storing features like this in a SQL DB is generally not helpful. When you access word counts or word vectors you typically don't need a subset of them, you need them all at once, so the relational features of SQL aren't helpful.

Upvotes: 2

LukasP
LukasP

Reputation: 96

This would depend on a number of factors, such as the precise SQL DB you intend to use and how you store this embedding. For instance, PostgreSQL allows to store query and retrieve JSON variables ( https://www.postgresqltutorial.com/postgresql-json/ ) ; Other options as SQLite would allow to store string representations of JSONs or pickle objects - that would be OK for storing, but would make querying the elements inside the vector impossible.

Upvotes: 1

Related Questions