Spark filters are never applied to DataFrame in Java

Question

I am very new with Spark, and I have a query that brings data from two Oracle tables. Such tables have to be joined by a field, which works fine with the code below. However, I need to apply filters as in an Oracle "where" clause. For example, bring employees whose age is between 25 and 50. I also have to apply GroupBy filters and sort the final results with OrderBy. The thing is that the only action that is performed correctly is the retrieval of all data from the tables and the join between them. The rest of the filters are simply not applied and I have no idea of why. Can you please help me out with this? I am sure I am missing something because NO compile errors are gotten. The data is loaded fine, but the "where" clauses seem not to be having any effect on the data, although there are Employees with age between 25 and 50. Many thanks!

public static JavaRDD getResultsFromQuery(String connectionUrl) {

    JavaSparkContext sparkContext = new JavaSparkContext(new SparkConf()
            .setAppName("SparkJdbcDs").setMaster("local"));
    SQLContext sqlContext = new SQLContext(sparkContext);

    Map options = new HashMap<>();
    options.put("driver", "oracle.jdbc.OracleDriver");
    options.put("url", connectionUrl);
    options.put("dbtable", "EMPLOYEE");

    DataFrameReader dataFrameReader = sqlContext.read().format("jdbc")
            .options(options);

    DataFrame dataFrameFirstTable = dataFrameReader.load();

    options.put("dbtable", "DEPARTMENT");

    dataFrameReader = sqlContext.read().format("jdbc").options(options);

    DataFrame dataFrameSecondTable = dataFrameReader.load();

    //JOIN. IT WORKS JUST FINE!!!

    DataFrame resultingDataFrame = dataFrameFirstTable.join(dataFrameSecondTable, 
            "DEPARTMENTID");


    //FILTERS. THEY DO NOT THROW ERROR, BUT ARE NOT APPLIED. RESULTS ARE ALWAYS THE SAME, WITHOUT FILTERS
    resultingDataFrame.where(resultingDataFrame.col("AGE").geq(25));
    resultingDataFrame.where(resultingDataFrame.col("AGE").leq(50));

    JavaRDD resultFromQuery = resultingDataFrame.toJavaRDD();

    //HERE I CONFIRM THAT THE NUMBER OF ROWS GOTTEN IS ALWAYS THE SAME, SO THE FILTERS DO NOT WORK.
    System.out.println("Number of rows "+resultFromQuery.count());

    return resultFromQuery;

}

Justin Pihony · Accepted Answer

where returns a new dataframe and does NOT alter the existing one, so you need to store the output:

DataFrame greaterThan25 = resultingDataFrame.where(resultingDataFrame.col("AGE").geq(25));
DataFrame lessThanGreaterThan = greaterThan25.where(resultingDataFrame.col("AGE").leq(50));
JavaRDD resultFromQuery = lessThanGreaterThan.toJavaRDD();

Or you can just chain it:

DataFrame resultingDataFrame = dataFrameFirstTable.join(dataFrameSecondTable, "DEPARTMENTID")
  .where(resultingDataFrame.col("AGE").geq(25))
  .where(resultingDataFrame.col("AGE").leq(50));

Spark filters are never applied to DataFrame in Java

Answers (2)

Related Questions