Joris Peeters
Joris Peeters

Reputation: 133

Fast time series bulk insert with cassandra and python

I am trying to find the fastest way to insert time series data in Cassandra, using the cassandra python client. I am running on localhost.

The preparatory code looks like:

session.execute("CREATE KEYSPACE testt WITH replication = {'class' : 'SimpleStrategy', 'replication_factor': '1'}")
session.set_keyspace('testt')
session.execute('CREATE TABLE t (id text, time timestamp, t float, PRIMARY KEY (id, time))')

prepared = session.prepare("""
    INSERT INTO t (id, time, t)
    VALUES (?, ?, ?)
    """)

index = pd.date_range("2003-01-01", "2004-01-01", freq="1min")
t = np.random.rand(len(index))*2-1
df = pd.DataFrame({'time': index, 't': t}).set_index('time')

print(len(df))
# 525601

And then I have tried two strategies. The first is using BatchStatements:

before = datetime.datetime.now()
x = 0
delta = 65535
while x < len(df):
    batch = BatchStatement(cassandra.query.BatchType.UNLOGGED)
    for i, r in df.iloc[x : x + delta].iterrows():
        batch.add(prepared, ("ID1", i, r['t']))
    session.execute(batch)    
    x += delta  
print("Elapsed: {}".format(datetime.datetime.now() - before)) 
# Elapsed: 0:01:01.341848

Note that the 65535 jumps are due to an upper limit on the size of a BatchStatement.

.. and the second one is using execute_async:

before = datetime.datetime.now()
for i, r in df.iterrows():
    session.execute_async(prepared.bind(("ID1", i, r['t'])))
print("Elapsed: {}".format(datetime.datetime.now() - before)) 
# Elapsed: 0:03:51.169409

I have also tried the same scenario with InfluxDB (using its DataFrame client), where the same insertion takes about 5 seconds. So my question is if fell victim to some kind of anti-pattern (I read a few of the blog posts, but seem to be doing things OK), or if this type of insertion is simply x12 slower using Cassandra.

There's room for scaling up to more server nodes - or python multiprocessing etc on the client side - but I wanted to understand the single-node performance first.

Any pointers would be much appreciated!

Upvotes: 1

Views: 1182

Answers (1)

Chris Lohfink
Chris Lohfink

Reputation: 16410

Batching is (except for special use cases) always going to be slower then just inserting them. The main case for doing unlogged batches for performance improvements is if they all belong to the same partition, which yours is not.

cqlsh's COPY FROM is probably as fast as you will see using the python driver. You can see the code https://github.com/apache/cassandra/blob/trunk/pylib/cqlshlib/copyutil.py thats been heavily optimized.

Upvotes: 1

Related Questions