Reputation: 159
I am new to Spark and Scala and even after reading various documents I am still unable to find best way to solve an issue.
I have a fairly large data set (~TB) that could be loaded into a dataframe as follows:
1 million rows
Columns are time
, data
, Info1
, Info2
.
Except time
which is a float, all others are arrays of floats of size ~200K.
Info1
and Info2
are identical for all rows.
It appears that shared variables (such as broadcast variables) cannot be accessed by dataframes/data sets.
Rows can be case classes but they can't have static variables/companion objects in Spark.
Rows cannot be regular classes.
Only way out seems to have redundancy with info1
and info2
being same across all columns, but it seems terribly expensive in cases such as these.
Using crossJoin
may have too much communication cost.
I would be grateful for any inputs in representing the data in Spark.
TIA.
Upvotes: 0
Views: 74
Reputation: 1144
The following is one of the simplest solutions wherein a new column with a constant is added:
val arr = Array(12.223F, 12.1F, 213.21F)
val df1 = df2.withColumn("info", lit(arr))
Upvotes: 1