Reputation: 1451
I would like to write each column of a dataframe into a file or folder, like bucketing, except, on all the columns. Is it possible to do this without writing a loop to do this? I suppose I can also stack the columns and write with a bucketby, are these the only way?
This question is related to another question about spark Column-wise processing here on stackoverflow.
Upvotes: 0
Views: 483
Reputation: 527
Not sure if this is what you are looking you can use multiprocessing if to perform the operation parallelly
import pyspark
import sys
from multiprocessing.pool import ThreadPool
import pyspark.sql.functions as f
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [
("var1", "a","c"),
("var2", "b","d"),
("var3", "b","a"),
("var4", "d","c")
]
schema = StructType([
StructField('name', StringType(),True), \
StructField('numerator', StringType(),True), \
StructField('denonminator', StringType(),True)
])
df = spark.createDataFrame(data=data, schema=schema)
"""Change how may threads to run """
pool = ThreadPool(3)
def parallelwrite(x):
try:
df.select(x).write.mode("overwrite").format("csv").save("/hdfsData/bdipoc/poc/inbound/{}/".format(x))
except Exception as e:
print(e)
pool.map( parallelwrite,df.columns)
Upvotes: 1