Interpolation between two Dataframes (years with values) in Pyspark

Question

How can I implement linear interpolation between two PySpark DataFrames representing data for different years, say 2020 and 2030, to generate a new PySpark DataFrame for an intermediary year like 2025? Both DataFrames have identical structures with numeric values. The years have the same granularity.

My initial approach involved https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.interpolate.html

Is this the recommended way?

I have written this Pandas method awhile back, but I need to mgirate to Pyspark, but I struggle to implement the same in Pandas.

def interpolate_between_years(first: DataFrame, second: DataFrame) -> DataFrame:

    years = [first.index.year[0], second.index.year[0]]
    interpolated_df = (
        pd.concat(
            [first.reset_index(drop=True), second.reset_index(drop=True)],
            keys=years,
            axis=1,
        )
        .T.reindex(np.arange(years[0], years[1] + 1))
        .interpolate()
    )

    return interpolated_df

Interpolation between two Dataframes (years with values) in Pyspark

Answers (1)

Related Questions