Reputation: 119
I'd like to create a new column (v5) based on the existing subset of columns in the dataframe.
Sample dataframe:
+---+---+---+---+
| v1| v2| v3| v4|
+---+---+---+---+
| 2| 4|7.0|4.0|
| 99| 0|2.0|0.0|
|189| 0|2.4|0.0|
+---+---+---+---+
providing another view of the sample dataframe:
+---+---+---+---+
| v1| v3| v2| v4|
+---+---+---+---+
| 2|7.0| 4|4.0|
| 99|2.0| 0|0.0|
|189|2.4| 0|0.0|
+---+---+---+---+
which is created by:
rdd1 = sc.parallelize([
(2, 7.0, 4, 4.0),
(99, 2.0, 0, 0.0),
(189, 2.4, 0, 0.0)])
d = sqlContext.createDataFrame(rdd1, ('v1', 'v3','v2','v4'))
Ultimately, what I'd like to do is create another column v5 which is the value corresponding to minimum value of v1 and v2 ignoring the zeros and nulls present in either of the column. Assume v1 as key and v3 is value pair.Similarly v2 is the key and v4 is the value. For example, in the first row: Amongst v1 and v2, the least value belongs to v1 i.e., 2 so the output in v5 column should be 7.0 Likewise, in second row: ignoring zero and null values of v1 & v2, the output should be 2.0
The original dataframe has five columns as keys and respective five columns as values Desired output:
+---+---+---+---+---+
| v1| v2| v3| v4| v5|
+---+---+---+---+---+
| 2| 4|7.0|4.0|7.0|
| 99| 0|2.0|0.0|2.0|
|189| 0|2.4|0.0|2.4|
+---+---+---+---+---+
I tried to accomplish this via least function in a udf, couldn't get that to work. I'm using PySpark 1.6. Any help is much appreciated.
Upvotes: 1
Views: 7785
Reputation: 35219
With data:
df = spark.createDataFrame([
(2, 4, 3.0, .0), (99, 0, 2.0, 0.0), (189, 0, 2.4, 0.0)],
("v1", "v2", "v3", "v4")
)
you can just substitute NULL
/ 0
with with -Inf
or +Inf
.
from pyspark.sql.functions import col, lit, least, greatest, when
cols = ["v3", "v4"]
min_ = least(*[
when(col(c).isNull() | (col(c) == 0), float("inf")).otherwise(col(c))
for c in cols
]).alias("min")
max_ = greatest(*[
when(col(c).isNull() | (col(c) == 0), float("-inf")).otherwise(col(c))
for c in cols
]).alias("max")
And select:
df.select("*", min_, max_).show()
# +---+---+---+---+---+---+
# | v1| v2| v3| v4|min|max|
# +---+---+---+---+---+---+
# | 2| 4|3.0|7.0|3.0|7.0|
# | 99| 0|2.0|0.0|2.0|2.0|
# |189| 0|2.4|0.0|2.4|2.4|
# +---+---+---+---+---+---+
Upvotes: 5