find mean and corr of 10,000 columns in pyspark Dataframe

Question

I have DF with 10K columns and 70Million rows. I want to calculate the mean and corr of 10K columns. I did below code but it wont work due to code size 64K issue (https://issues.apache.org/jira/browse/SPARK-16845)

Data:

region dept week sal val1  val2  val3 ... val10000   
 US    CS   1     1    2    1     1   ...  2 
 US    CS   2     1.5  2    3     1   ...  2
 US    CS   3     1    2    2     2.1      2
 US    ELE  1     1.1  2    2     2.1      2
 US    ELE  2     2.1  2    2     2.1      2
 US    ELE  3     1    2    1     2   .... 2
 UE    CS   1     2    2    1     2   .... 2

Code:

aggList =  [func.mean(col) for col in df.columns]  #exclude keys
df2= df.groupBy('region', 'dept').agg(*aggList)

code 2

aggList =  [func.corr('sal', col).alias(col) for col in df.columns]  #exclude keys
df2  = df.groupBy('region', 'dept', 'week').agg(*aggList)

this fails. Is there any alternative way to overcome this bug? and any one tried DF with 10K columns?. Is there any suggestion on performance improvement?

find mean and corr of 10,000 columns in pyspark Dataframe

Answers (1)

Related Questions