Reputation: 4023
I use DataFrame API.
I have existing DataFrame and a List object (can also use Array). How is it possible to add this List to existing DataFrame as a new column? Should I use the class Column for this?
Upvotes: 5
Views: 7349
Reputation: 168
This thread is a little old, but I ran into a similar situation using Java. I think more than anything, there was a conceptual misunderstanding of how I should approach this problem.
To fix my issue, I created a simple POJO to assist with the new column for a Dataset (as opposed to trying to build on an existing one). I think conceptually, I didn't understand that it was best to generate the Dataset during the initial read where the additional column needed to be added. I hope this helps someone in the future.
Consider the following:
JavaRDD<MyPojo> myRdd = dao.getSession().read().jdbc("jdbcurl","mytable",someObject.getProperties()).javaRDD().map( new Function<Row,MyPojo>() {
private static final long serialVersionUID = 1L;
@Override
public MyPojo call(Row row) throws Exception {
Integer curDos = calculateStuff(row); //manipulate my data
MyPojo pojoInst = new MyPojo();
pojoInst.setBaseValue(row.getAs("BASE_VALUE_COLUMN"));
pojoInst.setKey(row.getAs("KEY_COLUMN"));
pojoInst.setCalculatedValue(curDos);
return pojoInst;
}
});
Dataset<Row> myRddRFF = dao.getSession().createDataFrame(myRdd, MyPojo.class);
//continue load or other operation here...
Upvotes: 1
Reputation: 3719
Here is an example where we had a column date and wanted to add another column with month.
Dataset<Row> newData = data.withColumn("month", month((unix_timestamp(col("date"), "MM/dd/yyyy")).cast("timestamp")));
Hoping this helps !
Cheers !
Upvotes: 2
Reputation: 8427
You should probably convert your List to a single Column RDD and apply join on critetia pickeg by you. Simple DataFrame conversion:
val df1 = sparkContext.makeRDD(yourList).toDF("newColumn")
If you need to create additional column to perform join on you can add more columns, mapping your list:
val df1 = sparkContext.makeRDD(yourList).map(i => (i, fun(i)).toDF("newColumn", "joinOnThisColumn")
I am not familiar with Java version, but you should try using JavaSparkContext.parallelize(yourList)
and apply similar mapping operations based on this doc.
Upvotes: 5
Reputation: 4023
Sorry, It was my fault, I already found the function withColumn(String colName, Column col)
which should solve my problem
Upvotes: 2