Why does a Cassandra table export with pycassa return fewer rows than via CQL SELECT

Question

I have been tasked with upgrading a Cassandra installation with millions of records from version 2.1 to 3.11 (the latest). To complicate matters there are some internal formatting issues with misencoded UTF8 values.

I fixed the UTF8 issues by changing every validator to Byte type, so at least now all the records are visible (i.e. do not trigger format errors). But I am having trouble exporting the data to CSV files:

A CQL SELECT command, either in cqlsh or in a script via the DataStax driver would be the obvious solution. But SELECT statements have a default LIMIT of 10,000 and I have read that changing this to a much larger value will cause all kinds of errors, and in fact the documentation advises that it must not be set to more than 2,000,000. So that is the CQL approach excluded.
The dsbulk utility would be the next choice. But when I tried this on a test table, it produced the output byte strings in a weird encoding that I couldn't make sense of.

So I had to fall back on Plan C, which is to use the Pycassa driver to export the data. However, this throws up yet another problem - The number of records it reads is about half of the number seen by CQL! My question is Why?!

This is my python2 pycassa script (whose output is intended to be identical to that of SQL SELECT, for comparison) :

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
#  Python 2 Script to export raw CQL byte output of 'complete' or 'partial' column familes from Cassandra 2.1.x tables
#
#  For small table exports, the CQL SELECT statement in table_export2.sh is fine. But SELECT has a default LIMIT of
#  10,000 and a zero LIMIT (meaning none) cannot be specified, and even if it could the process would hang and fall
#  over in a heap if millions of rows had to be exported.
#
import sys
import os
import re
import codecs
import collections
import binascii

import pycassa
from pycassa.pool import ConnectionPool
from pycassa.columnfamily import ColumnFamily

cass_node = sys.argv[1] if len(sys.argv) >= 2 else ''
namespace = sys.argv[2] if len(sys.argv) >= 3 else ''
col_fam   = sys.argv[3] if len(sys.argv) >= 4 else ''

if col_fam not in [ 'complete', 'partial' ]:
    print("Usage: python compart.py {cassandra_host_name} {cassandra_namespace_name}" +
                " { complete | partial } [ {CQL_file_name} ]")
    sys.exit(1)

file_path = sys.argv[4] if len(sys.argv) == 5 else ''

write_file = sys.stdout if file_path == '' else open(file_path, 'w')

# Export rows
#
#  Text file /etc/cassandra/conf/cassandra.yaml must include:
#
#    start_rpc: true
#    rpc_port: 9171
#
pool = ConnectionPool(namespace, [ cass_node + ':9171' ], \
                      credentials={'username': 'cassandra', 'password': 'cassandra'} )

table = pycassa.ColumnFamily(pool, col_fam)

print " key      | column1                                                                      | value"
print "----------+------------------------------------------------------------------------------+-------"

for fk, fv in table.get_range(start='', finish=''):

    key = binascii.hexlify(fk)

    if type(fv) == type(collections.OrderedDict()):

        for column1 in fv.keys():

            value = fv[column1]

            if value != '':
               value = binascii.hexlify(value)

            print '%9s |%77s |%6s' % ('0x' + key, '0x' + binascii.hexlify(column1), '0x' + value)

This produces about 700 rows, but when I run "SELECT COUNT(*) FROM complete" in cqlsh there are around 1300 rows. I checked the output using WinMerge, and the missing records are a block with larger key values. So it appears the pycassa/Thrift driver misses more recent records for some reason ?!!

Any ideas?

-=-=-=-=-=-=-=-=-=-=-=

IN response to Alex Ott's first comment:

Many thanks for your prompt reply Alex.

The table schema is :

CREATE TABLE app2_uat.complete (
    key blob,
    column1 blob,
    value blob,
    PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE
    AND CLUSTERING ORDER BY (column1 ASC)
    AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND dclocal_read_repair_chance = 0.0
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = 'NONE';

The dsbulk command I tried was :

dsbulk unload -u cassandra -p cassandra -h '["172.31.44.160"]'  -query "SELECT * from app2_uat.complete;"

My first issue with this approach is that it uses the bad old CQL SELECT with its LIMIT which I can't increase arbitarily or eliminate - SELECTs are out for that reason!

My second problem with it is that it outputs the blobs in a gibberish format, e.g. :

key,column1,value
MTU4,Yy0xNTgyZmYxYzkwMGYxNzQ0ZmFhOWVlOGRkZWQxOTM2OGI3MA==,IA==
MTU4,Yy0xNThhMmYyNDliMWJhOWI0YWVmOTc4OTM5ZjE0NzZmNGFjNQ==,IA==
MTU4,Yy0xNThjODJmZGYyZDIyMjk0YWFjODBhNjQ5Y2NiZGZhMDk5Mg==,IA==
OTQ=,Yy05NDRhYjYyM2YxODMzNDQwM2M5MmNmOTc0ZWJkNjRiZmY0,IA==
MTE3,Yy0xMTc3OTEzNWE0OTNlYjU0YzU1YTNjMTdhNzc5YTk2ZTM1ZQ==,
MTE3,Yy0xMTdiYzQ2ZmNhNTc1ZmQ0MDk3YmQ0NTYxODdhMzQxYTQ1Ng==,
NTg=,Yy01ODZmZmVhNTczYjRjNzQwNGJiYjFjNzM2MzMxNTM5Mzhj,
NTg=,Yy01ODhjMmI3ZWJjNWYwNjQ1OGQ5NGMwNDljOWI1OGRiYjk0,

which I can't relate to the hex output which CQL (and my pycassa script) produces:

key      | column1                                                                      | value
----------+------------------------------------------------------------------------------+-------
     0x36 |       0x632d3632373566376561633136323131653338343730303230303839646531306266 |  0x20
     0x36 |       0x632d3633383563356632633136323131653339636363303230303839646531306266 |  0x20
     0x36 |       0x632d3634363737643663633136323131653362346135303230303839646531306266 |  0x20
     0x36 |       0x632d3635353933303361633136323131653361326137303230303839646531306266 |  0x20
     0x36 |       0x632d3636326264653836633136323131653362323363303230303839646531306266 |  0x20
     0x36 |       0x632d3637306662343934633136323131653361616439303230303839646531306266 |  0x20
     0x36 |       0x632d3637663931376230633136323131653361623963303230303839646531306266 |  0x20
     0x36 |       0x632d3638636164373061633136323131653339396134303230303839646531306266 |  0x20

Can you think of any reason why pycassa would not output a block of records which are visible in CQL ?

Why does a Cassandra table export with pycassa return fewer rows than via CQL SELECT

Answers (0)

Related Questions