rdd.checkpoint is skipped in spark job

Question

Hi I am trying to run a long sparkjob that often fails due to StackoverflowError. The job reads a parquetfile and creates an rdd in a foreach loop. After doing some research I thought creating a checkpoint for each rdd would help me solve my memory issues. (I have tried different memory, overhead memory, paralellism, repartition and found the most working settings for the job, however sometimes it still fails depending on the load on our cluster.)

Now to my real issue. I am trying to create checkpoints, by first reading in the parquet creating an RDD, then caching it, running checkpoint function and then calling the action first to make the checkpoint happen. No checkpoints are created in the path that I have specified and it the YARN UI it says that stage is skipped. Can anyone help me understand the problem :)

  ctx.getSparkContext().setCheckpointDir("/tmp/checkpoints");
    public static void writeHhidToCouchbase(DataFrameContext ctx, List filePathsStrings)  {
    filePathsStrings
        .forEach(filePath -> {
          JavaPairRDD rdd =
              UidHhidPerDay.getParquetFromPath(ctx, filePath);
          rdd.cache();
          rdd.checkpoint();
          rdd.first();
          rdd.foreachPartition(p -> {
            CrumbsClient client = getClient();
            p.forEachRemaining(uids -> {
              Crumbs crumbs = client.getAsync(uids._1)
                  .timeout(10, TimeUnit.SECONDS)
                  .toBlocking()
                  .first();
              String hHid = uids._2;
              if (hHid != null) {
                crumbs.getOrCreateSingletonCrumb(HouseholdCrumb.class).setHouseholdId(hHid);
                client.putSync(crumbs);
              }
            });
            client.shutdown();
          });
        });
}

The checkpoint is created once in the first iteration but never again. KR

rdd.checkpoint is skipped in spark job

Answers (1)

Related Questions