Reputation: 113
I am working on creating a dataframe from a XML file using Spark in python. What I want to do is converting value in each row into new column and making dummy variable.
Here is the example.
Input:
id | classes |
-----+--------------------------+
132 | economics,engineering |
201 | engineering |
123 | sociology,philosophy |
222 | philosophy |
--------------------------------
Output:
id | economics | engineering | sociology | philosophy
-----+-----------+-------------+-----------+-----------
132 | 1 | 1 | 0 | 0
201 | 0 | 1 | 0 | 0
123 | 0 | 0 | 1 | 1
222 | 0 | 0 | 0 | 1
--------------------------------------------------------
Upvotes: 0
Views: 326
Reputation: 64
Explode column to multiple rows ref: Explode in PySpark
import pyspark.sql.functions as F
df = spark.createDataFrame([(132, "economics,engineering"),(201, "engineering"),(123, "sociology,philosophy"),(222, "philosophy")], ["id", "classes"])
+---+--------------------+
| id| classes|
+---+--------------------+
|132|economics,enginee...|
|201| engineering|
|123|sociology,philosophy|
|222| philosophy|
+---+--------------------+
explodeCol = df.select(col("id"), F.explode(F.split(col("classes"), ",")).alias("branch"))
+---+-----------+
| id| branch|
+---+-----------+
|132| economics|
|132|engineering|
|201|engineering|
|123| sociology|
|123| philosophy|
|222| philosophy|
+---+-----------+
explodeCol.groupBy("id").pivot("branch").agg(F.sum(lit(1))).na.fill(0).show()
+---+---------+-----------+----------+---------+
| id|economics|engineering|philosophy|sociology|
+---+---------+-----------+----------+---------+
|222| 0| 0| 1| 0|
|201| 0| 1| 0| 0|
|132| 1| 1| 0| 0|
|123| 0| 0| 1| 1|
+---+---------+-----------+----------+---------+
For more detailed Spark documentation ref to http://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html
Upvotes: 3