Reputation: 366
I have the following Dataframe
REC_DATA = spark.createDataFrame(
[
('exercise', 'fiber', 'rice', 'male'),
('exercise', 'rice', 'fiber', 'female'),
('exercise', 'water', 'fiber', 'male'),
('exercise', 'rice', 'exercise', 'female'),
],
StructType(
[
StructField("1_rec", StringType(), False),
StructField("2_rec", StringType(), False),
StructField("3_rec", StringType(), False),
StructField("sex", StringType(), True),
]
)
)
1_rec | 2_rec | 3_rec | sex |
---|---|---|---|
exercise | fiber | rice | male |
exercise | rice | fiber | female |
exercise | water | fiber | male |
water | rice | exercise | female |
And I'm trying to group these rows into a new column, transforming the columns 1_rec, 2_rec, 3_rec into rows, and add a new column with the quantity, the output should be like that:
Position | name | count |
---|---|---|
1_rec | exercise | 3 |
1_rec | water | 1 |
2_rec | water | 1 |
2_rec | rice | 2 |
2_rec | fiber | 1 |
3_rec | rice | 1 |
3_rec | fiber | 2 |
3_rec | exercise | 1 |
I had tried to do a Crosstab but it didn't work properly.
Upvotes: 0
Views: 219
Reputation: 2939
Use stack
to kind of unpivot columns then group them by position and name
import pyspark.sql.functions as F
REC_DATA = (REC_DATA
.selectExpr("stack(3, '1_rec', 1_rec, '2_rec', 2_rec, '3_rec', 3_rec) (position, name)")
.groupBy('position', 'name')
.agg(F.count("*").alias('count')))
REC_DATA.show()
+--------+--------+-----+
|position| name|count|
+--------+--------+-----+
| 1_rec| water| 1|
| 2_rec| rice| 2|
| 3_rec|exercise| 1|
| 3_rec| fiber| 2|
| 2_rec| water| 1|
| 3_rec| rice| 1|
| 1_rec|exercise| 3|
| 2_rec| fiber| 1|
+--------+--------+-----+
Upvotes: 3