Writing from Spark to HBase : org.apache.spark.SparkException: Task not serializable

Question

I'm on a heatmap project for my university, we have to get some data (212Go) from a txt file (coordinates, height), then put it in HBase to retrieve it on a web client with Express.

I practiced using a 144Mo file, this is working :

SparkConf conf = new SparkConf().setAppName("PLE");
JavaSparkContext context = new JavaSparkContext(conf);
JavaRDD data = context.textFile(args[0]);
Connection co = ConnectionFactory.createConnection(getConf());
createTable(co);
Table table = co.getTable(TableName.valueOf(TABLE_NAME));
Put put = new Put(Bytes.toBytes("KEY"));

for (String s : data.collect()) {
    String[] tmp = s.split(",");
    put.addImmutable(FAMILY,
                    Bytes.toBytes(tmp[2]),
                    Bytes.toBytes(tmp[0]+","+tmp[1]));
}

table.put(put);

But I now that I use the 212Go file, I got some memory errors, I guess the collect method gather all the data in memory, so 212Go is too much.

So now I'm trying this :

SparkConf conf = new SparkConf().setAppName("PLE");
JavaSparkContext context = new JavaSparkContext(conf);
JavaRDD data = context.textFile(args[0]);
Connection co = ConnectionFactory.createConnection(getConf());
createTable(co);
Table table = co.getTable(TableName.valueOf(TABLE_NAME));
Put put = new Put(Bytes.toBytes("KEY"));

data.foreach(line ->{
    String[] tmp = line.split(",");
    put.addImmutable(FAMILY,
                    Bytes.toBytes(tmp[2]),
                    Bytes.toBytes(tmp[0]+","+tmp[1]));
});

table.put(put);

And I'm getting "org.apache.spark.SparkException: Task not serializable", I searched about it and tried some fixing, without success, upon what I read here : Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects

Actually I don't understand everything in this topic, I'm just a student, maybe the answer to my problem is obvious, maybe not, anyway thanks in advance !

Writing from Spark to HBase : org.apache.spark.SparkException: Task not serializable

Answers (1)

Related Questions