Adding values from a dataframe to a column in another dataframe pyspark

Question

How can I add the values from dataframe A to a new column (sum) in dataframe B that contains the given pairs of dataframe A? Preferably with a UDF?

output should look like this:

dataframe A:

|id|value|
|--|-----|
|1 |   10|
|2 |  0.3|
|3 |  100|

dataframe B:(with added column sum)

|src|dst|sum  |
|---|---|-----|
|1  |2  |10.3 |
|2  |3  |100.3|
|3  |1  |110  |

I've tried this

dfB = dfB.withColumn('sum', sum(dfB.source,dfB.dst,dfA))

def sum(src,dst,dfA):
    return dfA.filter(dfA.id == src).collect()[0][1][0] + dfA.filter(dfA.id == dst).collect()[0][1][0]

Riley Schack · Accepted Answer

If dfA is small enough for a broadcast join, then then this should work:

dfB.join(dfA, how="left", on=F.col("src") == F.col("id")).select(
    "src", "dst", F.coalesce(F.col("value"), F.lit(0)).alias("v1")
).join(dfA, how="left", on=F.col("src") == F.col("id")).select(
    "src", "dst", (F.col("v1") + F.coalesce(F.col("value"), F.lit(0))).alias("sum")
)

You can remove .coalesce(), if the id column contains every src and dst value. There's a few ways to functional this, but your best bet may be using .transform().

def join_sum(join_df):
    def _(df):
        return (
            df.join(join_df, how="left", on=F.col("src") == F.col("id"))
            .select("src", "dst", F.coalesce(F.col("value"), F.lit(0)).alias("v1"))
            .join(join_df, how="left", on=F.col("src") == F.col("id"))
            .select(
                "src",
                "dst",
                (F.col("v1") + F.coalesce(F.col("value"), F.lit(0))).alias("sum"),
            )
        )

    return _


dfB.transform(join_sum(dfA))

Adding values from a dataframe to a column in another dataframe pyspark

Answers (2)

Related Questions