Guforu
Guforu

Reputation: 4023

How is it possible to add new column to existing Dataframe in Spark SQL

I use DataFrame API.

I have existing DataFrame and a List object (can also use Array). How is it possible to add this List to existing DataFrame as a new column? Should I use the class Column for this?

Upvotes: 5

Views: 7349

Answers (4)

Jacob
Jacob

Reputation: 168

This thread is a little old, but I ran into a similar situation using Java. I think more than anything, there was a conceptual misunderstanding of how I should approach this problem.

To fix my issue, I created a simple POJO to assist with the new column for a Dataset (as opposed to trying to build on an existing one). I think conceptually, I didn't understand that it was best to generate the Dataset during the initial read where the additional column needed to be added. I hope this helps someone in the future.

Consider the following:

        JavaRDD<MyPojo> myRdd = dao.getSession().read().jdbc("jdbcurl","mytable",someObject.getProperties()).javaRDD().map( new Function<Row,MyPojo>() {

                       private static final long serialVersionUID = 1L;

                       @Override
                       public MyPojo call(Row row) throws Exception {
                       Integer curDos = calculateStuff(row);   //manipulate my data

                       MyPojo pojoInst = new MyPojo();

                       pojoInst.setBaseValue(row.getAs("BASE_VALUE_COLUMN"));
                       pojoInst.setKey(row.getAs("KEY_COLUMN"));
                       pojoInst.setCalculatedValue(curDos);

                       return pojoInst;
                      }
                    });

         Dataset<Row> myRddRFF = dao.getSession().createDataFrame(myRdd, MyPojo.class);

//continue load or other operation here... 

Upvotes: 1

Sachin Thapa
Sachin Thapa

Reputation: 3719

Here is an example where we had a column date and wanted to add another column with month.

Dataset<Row> newData = data.withColumn("month", month((unix_timestamp(col("date"), "MM/dd/yyyy")).cast("timestamp")));

Hoping this helps !

Cheers !

Upvotes: 2

TheMP
TheMP

Reputation: 8427

You should probably convert your List to a single Column RDD and apply join on critetia pickeg by you. Simple DataFrame conversion:

 val df1 = sparkContext.makeRDD(yourList).toDF("newColumn")

If you need to create additional column to perform join on you can add more columns, mapping your list:

val df1 = sparkContext.makeRDD(yourList).map(i => (i, fun(i)).toDF("newColumn", "joinOnThisColumn")

I am not familiar with Java version, but you should try using JavaSparkContext.parallelize(yourList) and apply similar mapping operations based on this doc.

Upvotes: 5

Guforu
Guforu

Reputation: 4023

Sorry, It was my fault, I already found the function withColumn(String colName, Column col) which should solve my problem

Upvotes: 2

Related Questions