Reputation: 369
I have the following pyspark dataframe
Car | Time | Val1 | Val2 | Val 3 |
---|---|---|---|---|
1 | 1 | None | 1.5 | None |
1 | 1 | 3.5 | None | None |
1 | 1 | None | None | 3.4 |
1 | 2 | 2.5 | None | None |
1 | 2 | None | 6.0 | None |
1 | 2 | None | None | 7.3 |
I want to fill in the gaps and combine these rows using the car/time column as a key of sorts. Specifically, if the car/time column for two (or more) rows is identical, then combine all the rows into one. It is guaranteed that only one of Val1/Val2/Val will be filled out for duplicate rows. You will never have a case where two rows have the same values in the car/time column, but different/not None values in another column. The resulting dataframe therefore should look like this.
Car | Time | Val1 | Val2 | Val3 |
---|---|---|---|---|
1 | 1 | 3.5 | 1.5 | 3.4 |
1 | 2 | 2.5 | 6.0 | 7.3 |
Thanks in advance for your help
Upvotes: 0
Views: 104
Reputation: 3733
You can use group by with aggregate function First with flag ingnorenulls set to true
import pyspark.sql.functions as F
from pyspark.sql import Window
data = [
{"Car": 1, "Time": 1, "Val1": None, "Val2": 1.5, "Val3": None},
{"Car": 1, "Time": 1, "Val1": 3.5, "Val2": None, "Val3": None},
{"Car": 1, "Time": 1, "Val1": None, "Val2": None, "Val3": 3.4},
{"Car": 1, "Time": 2, "Val1": 2.5, "Val2": None, "Val3": None},
{"Car": 1, "Time": 2, "Val1": None, "Val2": 6.0, "Val3": None},
{"Car": 1, "Time": 2, "Val1": None, "Val2": None, "Val3": 7.3},
{"Car": 2, "Time": 3, "Val1": None, "Val2": None, "Val3": 9.2},
]
df = spark.createDataFrame(data)
df.groupBy("Car", "Time").agg(
F.first("Val1", ignorenulls=True).alias("Val1"),
F.first("Val2", ignorenulls=True).alias("Val1"),
F.first("Val3", ignorenulls=True).alias("Val1"),
).show()
I added ona extra line just to check how it behaves with only one entry, imo its fine
output is
+---+----+----+----+----+
|Car|Time|Val1|Val1|Val1|
+---+----+----+----+----+
| 1| 1| 3.5| 1.5| 3.4|
| 1| 2| 2.5| 6.0| 7.3|
| 2| 3|null|null| 9.2|
+---+----+----+----+----+
Upvotes: 2