harsh solanki
harsh solanki

Reputation: 67

How to store multidimensional array in cassandra and hive

So, I am following this example:

https://keras.io/examples/nlp/pretrained_word_embeddings/

In this example, an embedding matrix is being generated in following secti

num_tokens = len(voc) + 2
embedding_dim = 100
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

How can this be pushed to cassandra and hive. I have tried following query:

statement = "CREATE TABLE schema.upcoming_calendar3 ( embedding_matrix list<frozen<set>>, PRIMARY KEY ( embedding_matrix) );"

However, that gives me following error:

InvalidRequest: Error from server: code=2200 [Invalid query] message="Invalid non-frozen collection type for PRIMARY KEY component embedding_matrix"

Similarly, I wanna send that to hive as well.

Any help on what data type would be used in cassandra and hive would be great along with more efficient way of sending it to the DB.

Currently, I am pushing data like this:

statement = "insert into schema.upcoming_calendar3(embedding_matrix) values (%s);" % (embedding_matrix)

Upvotes: 1

Views: 274

Answers (1)

leftjoin
leftjoin

Reputation: 38335

Declare upper level collection as frozen like this:

embedding_matrix frozen<list<set<text>>>

if you want to use it as a primary key.

In hive corresponding datatype is array<array<type>>, see the manual.

Upvotes: 1

Related Questions