Reputation: 999
I have DF with 10K columns and 70Million rows. I want to calculate the mean and corr of 10K columns. I did below code but it wont work due to code size 64K issue (https://issues.apache.org/jira/browse/SPARK-16845)
Data:
region dept week sal val1 val2 val3 ... val10000
US CS 1 1 2 1 1 ... 2
US CS 2 1.5 2 3 1 ... 2
US CS 3 1 2 2 2.1 2
US ELE 1 1.1 2 2 2.1 2
US ELE 2 2.1 2 2 2.1 2
US ELE 3 1 2 1 2 .... 2
UE CS 1 2 2 1 2 .... 2
Code:
aggList = [func.mean(col) for col in df.columns] #exclude keys
df2= df.groupBy('region', 'dept').agg(*aggList)
code 2
aggList = [func.corr('sal', col).alias(col) for col in df.columns] #exclude keys
df2 = df.groupBy('region', 'dept', 'week').agg(*aggList)
this fails. Is there any alternative way to overcome this bug? and any one tried DF with 10K columns?. Is there any suggestion on performance improvement?
Upvotes: 4
Views: 2366
Reputation: 1269
We also ran into the 64KB issue, but in a where clause, which is filed under another bug report. What we used as a workaround, is simply, to do the operations/transformations in several steps.
In your case, this would mean, that you don't do all the aggregatens in one step. Instead loop over the relevant columns in an outer operation:
select
to create a temporary dataframe, which just contains columns you need for the operation.groupBy
and agg
like you did, except not for a list of aggregations, but just for on (or two, you could combine the mean
and corr
.withColumn
to append the aggregated columns from the temporary dataframes to a result df. Due to the lazy evaluation of a Spark DAG, this is of course slower as doing it in one operation. But it should evaluate the whole analysis in one run.
Upvotes: 1