smeeb
smeeb

Reputation: 29537

Converting Java Map to Spark DataFrame (Java API)

I'm trying to use Spark (Java API) to take an in-memory Map (that potentially contains other nested Maps as its values) and convert it into a dataframe. I think I need something along these lines:

Map myMap = getSomehow();
RDD myRDD = sparkContext.makeRDD(myMap); // ???
DataFrame df = sparkContext.read(myRDD); // ???

But I'm having a tough time seeing the forest through the trees here...any ideas? Again this might be a Map<String,String> or a Map<String,Map>, where there could be several nested layers of maps-inside-of-maps-inside-of-maps, etc.

Upvotes: 0

Views: 4930

Answers (1)

raxous
raxous

Reputation: 69

So I tried something, not sure if this is the most efficient option to do it, but I do not see any other right now.

    SparkConf sf = new SparkConf().setAppName("name").setMaster("local[*]");
    JavaSparkContext sc = new JavaSparkContext(sf);
    SQLContext sqlCon = new SQLContext(sc);

    Map map = new HashMap<String, Map<String, String>>();
    map.put("test1", putMap);

    HashMap putMap = new HashMap<String, String>();
    putMap.put("1", "test");


    List<Tuple2<String, HashMap>> list = new ArrayList<Tuple2<String, HashMap>>();

    Set<String> allKeys = map.keySet();
    for (String key : allKeys) {
        list.add(new Tuple2<String, HashMap>(key, (HashMap) map.get(key)));
    };

    JavaRDD<Tuple2<String, HashMap>> rdd = sc.parallelize(list);

    System.out.println(rdd.first());

    List<StructField> fields = new ArrayList<>();
    StructField field1 = DataTypes.createStructField("String", DataTypes.StringType, true);
    StructField field2 = DataTypes.createStructField("Map",
            DataTypes.createMapType(DataTypes.StringType, DataTypes.StringType), true);

    fields.add(field1);
    fields.add(field2);

    StructType struct = DataTypes.createStructType(fields);

    JavaRDD<Row> rowRDD = rdd.map(new Function<Tuple2<String, HashMap>, Row>() {

        @Override
        public Row call(Tuple2<String, HashMap> arg0) throws Exception {
            return RowFactory.create(arg0._1, arg0._2);
        }

    });

    DataFrame df = sqlCon.createDataFrame(rowRDD, struct);

    df.show();

In this scenario I assumed that the Map in the Dataframe is of Type (String, String). Hope this helps!

Edit: Obviously you can delete all the prints. I did this for visualization purposes!

Upvotes: 1

Related Questions