Handling repetetive data in Spark dataframes/datasets

Question

I am new to Spark and Scala and even after reading various documents I am still unable to find best way to solve an issue.

I have a fairly large data set (~TB) that could be loaded into a dataframe as follows:

1 million rows
Columns are time, data, Info1, Info2.
Except time which is a float, all others are arrays of floats of size ~200K.
Info1 and Info2 are identical for all rows.
It appears that shared variables (such as broadcast variables) cannot be accessed by dataframes/data sets.
Rows can be case classes but they can't have static variables/companion objects in Spark.
Rows cannot be regular classes.
Only way out seems to have redundancy with info1 and info2 being same across all columns, but it seems terribly expensive in cases such as these.
Using crossJoin may have too much communication cost.

I would be grateful for any inputs in representing the data in Spark.

TIA.

Quiescent · Accepted Answer

The following is one of the simplest solutions wherein a new column with a constant is added:

val arr = Array(12.223F, 12.1F, 213.21F)
val df1 = df2.withColumn("info", lit(arr))

Answers (1)