Gogolex
Gogolex

Reputation: 141

Export complete table to csv from cassandra

I have a very large table in cassandra (~500mil) and I want to export all rows for some columns to a file. I tried this using the COPY command with:

COPY keyspace.table (id, value) TO 'filepath' WITH DELIMITER=',';

but it took ~12 hours to complete the export. Is there any option this could be done faster?

If it is a problem to just export some columns it wouldn't be a problem to export all data. The important thing is that I need a way to get all entries which I can proceed afterwards.

The other question is, is it possible to process this export in PHP just with the DataStax PHP driver?

Upvotes: 2

Views: 1627

Answers (3)

Puspendu Banerjee
Puspendu Banerjee

Reputation: 2651

There are some of options which can give you fast & reliable turn-around:

  1. Hive [ My Preferred One, run SQL like Query ]
  2. Shark/Beeline [ run SQL like Query ]
  3. Spark [ Fast for data related computation but not the best option for your use-case]

For PHP[Hive PHP Client]:

<?php
// set THRIFT_ROOT to php directory of the hive distribution
$GLOBALS['THRIFT_ROOT'] = '/lib/php/';
// load the required files for connecting to Hive
require_once $GLOBALS['THRIFT_ROOT'] . 'packages/hive_service/ThriftHive.php';
require_once $GLOBALS['THRIFT_ROOT'] . 'transport/TSocket.php';
require_once $GLOBALS['THRIFT_ROOT'] . 'protocol/TBinaryProtocol.php';
// Set up the transport/protocol/client
$transport = new TSocket('localhost', 10000);
$protocol = new TBinaryProtocol($transport);
$client = new ThriftHiveClient($protocol);
$transport->open();

// run queries, metadata calls etc
$client->execute('SELECT * from src');
var_dump($client->fetchAll());
$transport->close();

Ref: https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-PHP

Upvotes: 0

Gillespie
Gillespie

Reputation: 2228

The short answer is yes, there are faster ways to do this.

The how is a longer answer, if you are going to be saving these rows to file on a regular basis - you might want to use Apache Spark. Depending how much memory is on your Cassandra nodes, you can bring a simple 500 million row table scan => write to file down to < 1 hour.

Upvotes: 1

Andriy Kuba
Andriy Kuba

Reputation: 8263

COPY ... TO ... not a good idea to use on a big amount of data.

is it possible to process this export in PHP just with the DataStax PHP driver

I did CSV export from the Cassandra with the help of Datastax Java driver, but PHP must have the same algorithm. According to documentation you can easily do a request and print output. Take in to attention pagination as well.

You can convert array to CSV with the help of fputcsv funciton

So, the simplest example would be:

<?php
$cluster   = Cassandra::cluster()                 // connects to localhost by default
                 ->build();
$keyspace  = 'system';
$session   = $cluster->connect($keyspace);        // create session, optionally scoped to a keyspace
$statement = new Cassandra\SimpleStatement(       // also supports prepared and batch statements
    'SELECT keyspace_name, columnfamily_name FROM schema_columnfamilies'
);
$future    = $session->executeAsync($statement);  // fully asynchronous and easy parallel execution
$result    = $future->get();                      // wait for the result, with an optional timeout

// Here you can print CSV headers.

foreach ($result as $row) {                       // results and rows implement Iterator, Countable and ArrayAccess
    // Here you can print CSV values  
    // printf("The keyspace %s has a table called %s\n", $row['keyspace_name'], $row['columnfamily_name']);
}  

Upvotes: 1

Related Questions