Joyce
Joyce

Reputation: 1451

bucket all the columns in an spark dataframe

I would like to write each column of a dataframe into a file or folder, like bucketing, except, on all the columns. Is it possible to do this without writing a loop to do this? I suppose I can also stack the columns and write with a bucketby, are these the only way?

This question is related to another question about spark Column-wise processing here on stackoverflow.

Upvotes: 0

Views: 483

Answers (1)

Rafa
Rafa

Reputation: 527

Not sure if this is what you are looking you can use multiprocessing if to perform the operation parallelly

import pyspark
import sys
from multiprocessing.pool import ThreadPool
import pyspark.sql.functions as f
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [
   ("var1", "a","c"),
   ("var2", "b","d"),
   ("var3", "b","a"),
   ("var4", "d","c")
    ]
    
schema = StructType([
StructField('name', StringType(),True), \
StructField('numerator', StringType(),True), \
StructField('denonminator', StringType(),True)
])
df = spark.createDataFrame(data=data, schema=schema)
"""Change how may threads to run """ 
pool = ThreadPool(3)
def parallelwrite(x):
    try:
        df.select(x).write.mode("overwrite").format("csv").save("/hdfsData/bdipoc/poc/inbound/{}/".format(x))
    except Exception as e:
        print(e)

pool.map( parallelwrite,df.columns)

Upvotes: 1

Related Questions