How to perform a multi-row multi-column operation in parallel within PySpark, with minimum loops?

Question

I would like to perform a multi-row multi-column operation in Pyspark with less or no loops. Spark 'df' has following data

city    time    temp    humid
NewYork 1500    67      57
NewYork 1600    69      55
NewYork 1700    70      56
Dallas  1500    47      37
Dallas  1600    49      35
Dallas  1700    50      39

I used 'For' loops but at the cost of parallelism and its not effective.

city_list = [i.city for i in df.select('city').distinct().collect()]
metric_cols = ['temp', 'humid']
for city in city_list:
    for metric in metric_cols:
        tempDF = df.filter(col("city") == city)
        metric_values = [(i[metric]) for i in tempDF.select(metric).collect()]
        time_values = [(i['time']) for i in tempDF.select('time').collect()]
        tuples = list(zip(time_values, metric_values))
        newColName = city + metric
        df = df.withColumn(newColName, lit(tuples))

I dont think its working either.

I expect the output to be

city    time  temp  humid timetemp                         timehumidity
NewYork 1500  67    57    [(1500,67),(1600,69),(1700,70)] [(1500,57),(1600,55),(1700,56)]
NewYork 1600  69    55    [(1500,67),(1600,69),(1700,70)] [(1500,57),(1600,55),(1700,56)]
NewYork 1700  70    56    [(1500,67),(1600,69),(1700,70)] [(1500,57),(1600,55),(1700,56)]
Dallas  1500  47    37    [(1500,47),(1600,49),(1700,50)] [(1500,37),(1600,35),(1700,39)]
Dallas  1600  49    35    [(1500,47),(1600,49),(1700,50)] [(1500,37),(1600,35),(1700,39)]
Dallas  1700  50    39    [(1500,47),(1600,49),(1700,50)] [(1500,37),(1600,35),(1700,39)]

or at the least

city     timetemp                         timehumidity
NewYork  [(1500,67),(1600,69),(1700,70)]  [(1500,57),(1600,55),(1700,56)]
Dallas   [(1500,47),(1600,49),(1700,50)]  [(1500,37),(1600,35),(1700,39)]

user8414391 · Accepted Answer

Found the solution with higher performance in PySpark

def create_tuples(df):
    mycols = ("temp","humid")
    lcols = mcols.copy()
    lcols.append("time")
    for lcol in lcols:
        df = df.select("*",collect_list(lcol).over(Window.partitionBy("city")).alias(lcol+'_list'))
    for mycol in mycols:
        df = df.withColumn(mycol+'_tuple', arrays_zip("time_list", mycol+'_list'))
    return df
tuples_df = create_tuples(df)

How to perform a multi-row multi-column operation in parallel within PySpark, with minimum loops?

Answers (2)

Related Questions