How to insert data into a 4 node cassandra cluster using datastax cassandra-core-api?

Question

I had a single node (DataStax) Casandra cluster , in which I had to insert some 10gb of data from a file. I wrote a java program to read the file and store the data as foll :

 import java.io.BufferedReader;
 import java.io.FileReader;
 import java.io.IOException;
 import java.util.Date;
 import com.datastax.driver.core.BoundStatement;
 import com.datastax.driver.core.Cluster;
 import com.datastax.driver.core.PreparedStatement;
 import com.datastax.driver.core.Session;

 public class Xb {

//cluster and session for cassandra connection
private static Cluster cluster;
private static Session session;

//variables for storing file elements
private static String taxid;
private static String geneid;
private static String status;
private static String rna_version;
private static String rna_gi;

private static String protein_version;
private static String protein_gi;
private static String gen_nuc_ver;

private static String gen_nuc_gi;
private static String start_gen_acc;
private static String end_gen_acc;

private static String orientation;
private static String assembly;

     private static String mature_ver;

     private static String mature_gi;

     private static String symbol;

    //Connecting the cassandra node(local host)
    public static Cluster connect(String node){
    return Cluster.builder().addContactPoint(node).build();
   }
    public static void main(String[] args) {
    private static String symbol;
    long lStartTime = new Date().getTime();
    // TODO Auto-generated method stub
    //call connect by passing localhost 
    cluster =connect("localhost");
    session = cluster.connect();
    //session.execute("CREATE KEYSPACE test1 WITH REPLICATION =" +"{'class':'SimpleStrategy','replication_factor':3}");
    //session.createtable('genomics');
    //use test1 : triggers the use of test1 keyspace
    session.execute("USE test1");
    //for counting the lines in the file
    int lineCount=0;

    try
    {
        //Reading the file
        FileReader fr = new FileReader("/home/syedammar/gene2refseq/gene2refseq");
        BufferedReader bf = new BufferedReader(fr);
        String line;
        //iterating over each line in file
        while((line= bf.readLine())!=null){
                lineCount++;
                //splitting the line based on tab spaces
                String[] a =line.split("\s+");
                System.out.println("Line Count now is ->"+lineCount);
                //System.out.println("This is content"+line+" OVER HERE");
                /*for(int i =0;i"+a[i]);
              }*/
                //assigning the values to the corresponding variables
                taxid =a[0];
                geneid=a[1];
                status=a[2];
                rna_version=a[3];
                rna_gi=a[4];
                protein_version=a[5];
                protein_gi=a[6]; 
                gen_nuc_ver=a[7];
                gen_nuc_gi=a[8];
                start_gen_acc=a[9];
                end_gen_acc=a[10];
                orientation=a[11];
                assembly=a[12];
                mature_ver=a[13];
                mature_gi=a[14];
                symbol=a[15];

            //Writing the insert query
            PreparedStatement statement = session.prepare(
            "INSERT INTO test.genomics " +
            "(taxid, " +
            "geneid, " +
            "status, " +
            "rna_version, " +
            "rna_gi, " +
            "protein_version, " +
            "protein_gi, " +
            "gen_nuc_ver, " +
            "gen_nuc_gi, " +
            "start_gen_acc, " +
            "end_gen_acc, " +
            "orientation, " +
            "assembly, " +
            "mature_ver, " +
            "mature_gi," +
            "symbol" + 
            ") VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?);"); 

            //create the bound statement and initialise it with your prepared statement
            BoundStatement boundStatement = new BoundStatement(statement); 

            session.execute( // this is where the query is executed
            boundStatement.bind( // here you are binding the 'boundStatement'
            taxid,geneid,status,rna_version,rna_gi,protein_version,protein_gi,gen_nuc_ver,gen_nuc_gi,start_gen_acc,end_gen_acc,orientation,assembly,mature_ver,mature_gi,symbol));
    }//end of while
} //end of try
    catch(IOException e){
        e.printStackTrace();
    }   
        long lEndTime = new Date().getTime(); 
        long difference = lEndTime - lStartTime;
        int seconds = (int) (difference / 1000) % 60 ; //converting milliseconds to seconds
        System.out.println("Elapsed seconds: " + seconds);
        System.out.println("No of lines read are :"+ lineCount);
        System.out.println("Record's entered into cassandra successfully");

        session.close();
        cluster.close();http://stackoverflow.com/editing-help

    }//end of m}// end of class

This worked fine i got the records stored in Cassandra.

Now I have set up a 4 node Cassandra cluster , and I wanna do the same task of reading the same file and storing its content into the 4 node cluster.

My question is how would I do that, to which node I need to feed this program. How do i approach this ?

And my query is how would I establish connection with the 4 node cluster, what changes will I have to make in the above code. Like there would be some change in this part

 public static Cluster connect(String node){
    return Cluster.builder().addContactPoint(node).build();
}

what would be the changes , N to which node do I feed this program ? I am not clear how would it happen. Also let me know will it take the same amount of time to insert the entire data in 4 node cluster as it took for single node or will it be faster.

Thanks

phact · Accepted Answer

For a good example (reference program) of how to best load data to Cassandra using the DataStax java driver, take a look at Brian Hess's Cassandra-loader.

which node do I need to feed this program

All cassandra nodes are equal, and all of them can take writes. The driver, however, takes care of this for you. Just give it a few of your nodes as endpoints and when it establishes the connection it will become aware of what nodes exist. It will also know what nodes own what data and perform the writes accordingly.

will it take the same amount of time to insert the entire data in 4 node cluster as it took for single node or will it be faster.

Once you take replication factor into account, your cluster will scale linearly as you add nodes. So you will be able to increase your throughput linearly. i.e. if 3 nodes RF3 can take X writes, 6 nodes with RF3 can take ~2X writes.

How to insert data into a 4 node cassandra cluster using datastax cassandra-core-api?

Answers (1)

Related Questions