xinit
xinit

Reputation: 159

Handling repetetive data in Spark dataframes/datasets

I am new to Spark and Scala and even after reading various documents I am still unable to find best way to solve an issue.

I have a fairly large data set (~TB) that could be loaded into a dataframe as follows:

  1. 1 million rows

  2. Columns are time, data, Info1, Info2.

  3. Except time which is a float, all others are arrays of floats of size ~200K.

  4. Info1 and Info2 are identical for all rows.

  5. It appears that shared variables (such as broadcast variables) cannot be accessed by dataframes/data sets.

  6. Rows can be case classes but they can't have static variables/companion objects in Spark.

  7. Rows cannot be regular classes.

  8. Only way out seems to have redundancy with info1 and info2 being same across all columns, but it seems terribly expensive in cases such as these.

  9. Using crossJoin may have too much communication cost.

I would be grateful for any inputs in representing the data in Spark.

TIA.

Upvotes: 0

Views: 74

Answers (1)

Quiescent
Quiescent

Reputation: 1144

The following is one of the simplest solutions wherein a new column with a constant is added:

val arr = Array(12.223F, 12.1F, 213.21F)
val df1 = df2.withColumn("info", lit(arr))

Upvotes: 1

Related Questions