Fast time series bulk insert with cassandra and python

Question

I am trying to find the fastest way to insert time series data in Cassandra, using the cassandra python client. I am running on localhost.

The preparatory code looks like:

session.execute("CREATE KEYSPACE testt WITH replication = {'class' : 'SimpleStrategy', 'replication_factor': '1'}")
session.set_keyspace('testt')
session.execute('CREATE TABLE t (id text, time timestamp, t float, PRIMARY KEY (id, time))')

prepared = session.prepare("""
    INSERT INTO t (id, time, t)
    VALUES (?, ?, ?)
    """)

index = pd.date_range("2003-01-01", "2004-01-01", freq="1min")
t = np.random.rand(len(index))*2-1
df = pd.DataFrame({'time': index, 't': t}).set_index('time')

print(len(df))
# 525601

And then I have tried two strategies. The first is using BatchStatements:

before = datetime.datetime.now()
x = 0
delta = 65535
while x < len(df):
    batch = BatchStatement(cassandra.query.BatchType.UNLOGGED)
    for i, r in df.iloc[x : x + delta].iterrows():
        batch.add(prepared, ("ID1", i, r['t']))
    session.execute(batch)    
    x += delta  
print("Elapsed: {}".format(datetime.datetime.now() - before)) 
# Elapsed: 0:01:01.341848

Note that the 65535 jumps are due to an upper limit on the size of a BatchStatement.

.. and the second one is using execute_async:

before = datetime.datetime.now()
for i, r in df.iterrows():
    session.execute_async(prepared.bind(("ID1", i, r['t'])))
print("Elapsed: {}".format(datetime.datetime.now() - before)) 
# Elapsed: 0:03:51.169409

I have also tried the same scenario with InfluxDB (using its DataFrame client), where the same insertion takes about 5 seconds. So my question is if fell victim to some kind of anti-pattern (I read a few of the blog posts, but seem to be doing things OK), or if this type of insertion is simply x12 slower using Cassandra.

There's room for scaling up to more server nodes - or python multiprocessing etc on the client side - but I wanted to understand the single-node performance first.

Any pointers would be much appreciated!

Fast time series bulk insert with cassandra and python

Answers (1)

Related Questions