Oak
Oak

Reputation: 69

Python multiprocessing bach insert in Cassandra, no performance improved

I tried batch insert in a single process and multiprocessing, but they used the same time. I didn't get any performance improved. keyspace of cassandra is SimpleStrategy, I think it has only one node. Do these influence?

This is my code for multiprocessing, could you help me find where is wrong?

lock = Lock()
ID = Value('i', 0)

def copy(x): 

    cluster = Cluster()
    session = cluster.connect('test')
    global lock, row_ID
    count = 0

    insertt = session.prepare("INSERT INTO table2(id, age, gender, name) values(?, ?, ?, ?)")
    batch = BatchStatement()

    for i in x:
        with open(files[i]) as csvfile:
            reader = csv.reader(csvfile, delimiter=',')
            for row in tqdm(reader):
                if count <= 59:
                    with lock:
                        ID.value += 1
                    name_ID = row[1]
                    gender_ID = row[2]
                    age_ID = int(row[3])
                    batch.add(insertt, (ID.value, age_ID, gender_ID, name_ID))
                    count += 1
                else: 
                    count = 0
                    with lock:
                        ID.value += 1
                    name_ID = row[1]
                    gender_ID = row[2]
                    age_ID = int(row[3])
                    batch.add(insertt, (ID.value, age_ID, gender_ID, name_ID))
                    session.execute(batch)
                    batch = BatchStatement()

if __name__ == '__main__':
    start = time.time()
    with Pool() as p:
        p.map(copy, [range(0,6),range(6,12),range(12,18),range(18,24)])
        end = time.time()
        t = end - start
        print(t)

Upvotes: 0

Views: 384

Answers (1)

Chris Lohfink
Chris Lohfink

Reputation: 16410

Batches are not there to improve performance, opposite really. Logged batches especially (what your using here) is more than 2x the cost of a normal write. An unlogged batch may improve performance slightly if all the data in the batch belongs in the same partition.

In this particular example your throughput will also be limited to how fast your csv reader can pull from disk. Since its blocking that is probably one of primary impacts to throughput. You can also use executeAsync so you dont block the building of next batch (once again shouldnt use batch here though) on the completion of previous one.

Upvotes: 1

Related Questions