Reputation:
I'm prototyping Apache Ignite for use in a new product. I need to store a very wide matrix in memory and access various random rows and columns. The matrix has 50,000 columns and potentially millions of rows. The matrix will essentially be read-only so we don't worry about write speeds.
We do not ever need to order the results. The client application will request specific rows by the primary key and perhaps certain columns, and occasionally the entire row(s).
I loaded the data into Apache Ignite into a SQL table that looks like this:
CREATE TABLE myMatrix
name CHAR(20) PRIMARY KEY,
col1 INT,
col2 INT,
col3 INT,
...
col50000 INT
I've also indexed the 'name' column
CREATE INDEX name_index ON myMatrix(name)
However, when I try to simply select one row, it takes over 10 seconds to return the result. We need much much faster response times - the client will expect a set of rows (could be hundreds or thousands) and columns (up to all the columns) in less than a second.
import pandas as pd
import pyignite
from pyignite import Client
import time
client = Client()
client.connect('127.0.0.1', 10800)
now = time.time()
result = client.sql('SELECT * FROM full_test_table WHERE name=\'F1S4_160106_058_G01\'')
print('Got Result in')
print(time.time() - now)
Why is this so painfully slow? I was hoping holding the entire table in memory would give faster results. Is it just the width of the table that's causing issues?
The database is running on an r4.8xlarge instance, with 32 cores and 244GB of memory.
Upvotes: 0
Views: 1926
Reputation: 52565
Based on @alamar's analysis, if you really can't move away from having thousands of columns, I'd recommend you use the k-v API: https://apacheignite.readme.io/docs/python-thin-client-key-value
Having said that, not having 50,000 columns in a cache/table would be a better solution.
Upvotes: 0
Reputation: 19313
I have tried running an example very similar to yours, and it turns out that Ignite's SQL parser (based on H2 SQL parser) has quadratic performance compexity of number of response columns. This means it is totally unfeasible having 50,000 columns in a table in Ignite, or anything larger than few dozens for that matter. I will try to file an issue against H2 bug tracker.
Previous answer:
Do you have persistence, or is everything in the RAM? I just can't see why it would take so long. Maybe there's some inlining problem - have you tried
CREATE INDEX name_index ON myMatrix(name) INLINE_SIZE 25
?I second that having 50,000 columns is not optimal. Better use an array.
Upvotes: 2
Reputation: 70
This is not exactly an answer, but I saw your previous question, and now I see you're using pandas. What if you just pickle your matrix (NumPy/pandas Array
, I suppose) and put it in Ignite cache as a ByteArrayObject
, and then unpickle it to pandas.Array
before use? Will it solve your question in general?
Here's a little remark, if it will. ByteArrayObject
is very slow in the current release, but I have already made the improvements, and fixes are waiting to be included in the next versions. Write me if you need them packaged.
There is also Apache Ignite Users mailing list. You can subscribe and discuss your task there.
Upvotes: 0