Pivot a Pyspark DataFrame to get a MultiColumn

Question

My Pyspark Dataframe looks like this:

+--------+----------+----+----+----+
|latitude| longitude|var1|date|var2|
+--------+----------+----+----+----+
|    3.45|     -8.65|   1|   7|   2|
|   30.45|     45.65|   1|   7|   2|
|   40.45|    123.65|   1|   7|   2|
|   43.45|     13.65|   1|   7|   2|
|   44.45|    -12.65|   1|   7|   2|
|   54.45|   -128.65|   1|   7|   2|
+--------+----------+----+----+----+

but I dont know how to reshape it to get only a register for each date and a multicolumn specifying [variable, latitude, longitude] in that order, so I could treat each combination of variable, latitude and longitude in a separated column.

Making this:

df.select(
    'date',
    *[F.array(F.col(col), F.col('latitude'), F.col('longitude')) for col in var_cols]
).show()

I get:

+----+---------------------------------+---------------------------------+
|date|array(var1, latitude, longitude) |array(var2, latitude, longitude) |
+----+---------------------------------+---------------------------------+
|   7|               [1.0, 3.45, -8.65]|               [2.0, 3.45, -8.65]|
|   7|              [1.0, 30.45, 45.65]|              [2.0, 30.45, 45.65]|
|   7|             [1.0, 40.45, 123.65]|             [2.0, 40.45, 123.65]|
|   7|              [1.0, 43.45, 13.65]|              [2.0, 43.45, 13.65]|
|   7|             [1.0, 44.45, -12.65]|             [2.0, 44.45, -12.65]|
|   7|             [1.0, 54.45, -128...|             [2.0, 54.45, -128...|
+----+---------------------------------+---------------------------------+

And I would like a column with a single value (the value of the var) and a column by EACH VALUE of the latitude and longitude. Imagine making an index of [date, latitude, longitude] in pandas and then unstacking the latitude and longitude columns.

For example, in pandas I would do this:

df.set_index(["date", "latitude", "longitude"]).unstack().unstack()

mck · Accepted Answer

How about this:

var_cols = [col for col in df.columns if col not in ['date', 'latitude', 'longitude']]

df.withColumn('latlong',
              F.concat_ws('_', F.col('latitude'), F.col('longitude'))) \
  .groupBy('date') \
  .pivot('latlong') \
  .agg(*[F.first(col) for col in var_cols])

Pivot a Pyspark DataFrame to get a MultiColumn

Answers (2)

Related Questions