Reputation: 1935
Spark 2.2.1 Pyspark
df = sqlContext.createDataFrame([
("dog", "1", "2", "3"),
("cat", "4", "5", "6"),
("dog", "7", "8", "9"),
("cat", "10", "11", "12"),
("dog", "13", "14", "15"),
("parrot", "16", "17", "18"),
("goldfish", "19", "20", "21"),
], ["pet", "dog_30", "cat_30", "parrot_30"])
And then I have list of the fields that I care above from the "Pet" column
dfvalues = ["dog", "cat", "parrot"]
I want to write code taht will give me the value from dog_30
, cat_30
or parrot_30
that corresponds to the value in "pet". For example, in the first row the value for the pet
column is dog
and so we take the value for dog_30
which is 1.
I tried using this to get the code, but it just gives me nulls for the column stats
. I also haven't figured out how to handle the goldfish
case. I want to set that to 0.
mycols = [F.when(F.col("pet") == p + "_30", p) for p in dfvalues]
df = df.withColumn("newCol2",F.coalesce(*stats) )
df.show()
Desired output:
+--------+------+------+---------+------+
| pet|dog_30|cat_30|parrot_30|stats |
+--------+------+------+---------+------+
| dog| 1| 2| 3| 1 |
| cat| 4| 5| 6| 5 |
| dog| 7| 8| 9| 7 |
| cat| 10| 11| 12| 11 |
| dog| 13| 14| 15| 13 |
| parrot| 16| 17| 18| 18 |
|goldfish| 19| 20| 21| 0 |
+--------+------+------+---------+------+
Upvotes: 0
Views: 101
Reputation: 214927
The logic is off; you need .when(F.col("pet") == p, F.col(p + '_30'))
:
mycols = [F.when(F.col("pet") == p, F.col(p + '_30')) for p in dfvalues]
df = df.withColumn("newCol2",F.coalesce(F.coalesce(*mycols),F.lit(0)))
df.show()
+--------+------+------+---------+-------+
| pet|dog_30|cat_30|parrot_30|newCol2|
+--------+------+------+---------+-------+
| dog| 1| 2| 3| 1|
| cat| 4| 5| 6| 5|
| dog| 7| 8| 9| 7|
| cat| 10| 11| 12| 11|
| dog| 13| 14| 15| 13|
| parrot| 16| 17| 18| 18|
|goldfish| 19| 20| 21| 0|
+--------+------+------+---------+-------+
Upvotes: 3