Johan Witters
Johan Witters

Reputation: 1636

How can I construct a String with the contents of a given DataFrame in Scala

Consider I have a dataframe. How can I retrieve the contents of that dataframe and represent it as a string.

Consider I try to do that with the below example code.

val tvalues: Array[Double] = Array(1.866393526974307, 2.864048126935307, 4.032486069215076, 7.876169953355888, 4.875333799256043, 14.316322626848278)
val pvalues: Array[Double] = Array(0.064020056478447, 0.004808399479386827, 8.914865448939047E-5, 7.489564524121306E-13, 2.8363794106756046E-6, 0.0)

val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]");
val sc = new SparkContext(conf)
val df = sc.parallelize(tvalues zip pvalues)
val sb = StringBuilder.newBuilder
df.foreach(x => {
  println("x = ", x)
  sb.append(x)
})
println("sb = ", sb)

The output of the code shows the example dataframe has contents:

(x = ,(1.866393526974307,0.064020056478447))
(x = ,(7.876169953355888,7.489564524121306E-13))
(x = ,(2.864048126935307,0.004808399479386827))
(x = ,(4.032486069215076,8.914865448939047E-5))
(x = ,(4.875333799256043,2.8363794106756046E-6))

However, the final stringbuilder contains an empty string.

Any thoughts how to retrieve a String for a given dataframe in Scala?

Many thanks

Upvotes: 0

Views: 476

Answers (3)

Johan Witters
Johan Witters

Reputation: 1636

Thanks everybody for the feedback and for understanding this slightly better.

The combination of responses result in the below. The requirements have changed slightly in that I represent my df as a list of jsons. The code below does this, without the use of the broadcast.

class HandleDf(df: DataFrame, limit: Int) extends java.io.Serializable {
  val jsons = df.limit(limit).collect.map(rowToJson(_))

  def rowToJson(r: org.apache.spark.sql.Row) : JSONObject = {
    try { JSONObject(r.getValuesMap(r.schema.fieldNames)) }
    catch { case t: Throwable =>
        JSONObject.apply(Map("Row with error" -> t.toString))
    }
  }
}

The class I use here...

val jsons = new HandleDf(df, 100).jsons

Upvotes: 0

Raphael Roth
Raphael Roth

Reputation: 27383

Does the output of df.show(false) help? If yes, then this SO answer helps: Is there any way to get the output of Spark's Dataset.show() method as a string?

Upvotes: 0

S. O.
S. O.

Reputation: 179

UPD: as mentioned by @user8371915, solution below will work only in single JVM in development (local) mode. In fact we cant modify broadcast variables like globals. You can use accumulators, but it will be quite inefficient. Also you can read an answer about read/write global vars here. Hope it will help you.

I think you should read topic about shared variables in Spark. Link here

Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.

Let's have a look at broadcast variables. I edited your code:

val tvalues: Array[Double] = Array(1.866393526974307, 2.864048126935307, 4.032486069215076, 7.876169953355888, 4.875333799256043, 14.316322626848278)
val pvalues: Array[Double] = Array(0.064020056478447, 0.004808399479386827, 8.914865448939047E-5, 7.489564524121306E-13, 2.8363794106756046E-6, 0.0)

val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]");
val sc = new SparkContext(conf)
val df = sc.parallelize(tvalues zip pvalues)
val sb = StringBuilder.newBuilder
val broadcastVar = sc.broadcast(sb)
df.foreach(x => {
  println("x = ", x)
  broadcastVar.value.append(x)
})
println("sb = ", broadcastVar.value)

Here I used broadcastVar as a container for a StringBuilder variable sb. Here is output:

(x = ,(1.866393526974307,0.064020056478447))
(x = ,(2.864048126935307,0.004808399479386827))
(x = ,(4.032486069215076,8.914865448939047E-5))
(x = ,(7.876169953355888,7.489564524121306E-13))
(x = ,(4.875333799256043,2.8363794106756046E-6))
(x = ,(14.316322626848278,0.0))
(sb = ,(7.876169953355888,7.489564524121306E-13)(1.866393526974307,0.064020056478447)(4.875333799256043,2.8363794106756046E-6)(2.864048126935307,0.004808399479386827)(14.316322626848278,0.0)(4.032486069215076,8.914865448939047E-5))

Hope this helps.

Upvotes: 2

Related Questions