Reputation: 11
I have data in the following format, and I want to change its format using pyspark with two columns ('tag' and 'data'). The 'tag' column values are unique, and the 'data' column values are a json string obtained from the orginial column 'date、stock、price' in which combine 'stock' and 'price' to be the 'A' columns value, combine 'date' and 'num' to be the 'B' columns value.
I didn't find or write good funcitions to realize this effect.
my spark version is 2.1.0
original DataFrame
date, stock, price, tag, num
1388534400, GOOG, 50, a, 1
1388534400, FB, 60, b, 2
1388534400, MSFT, 55, c, 3
1388620800, GOOG, 52, d, 4
I expect the output:
new DataFrame
tag| data
'a'| "{'A':{'stock':'GOOD', 'price': 50}, B:{'date':1388534400, 'num':1}"
'b'| "{'A':{'stock':'FB', 'price': 60}, B:{'date':1388534400, 'num':2}"
'c'| "{'A':{'stock':'MSFT', 'price': 55}, B:{'date':1388534400, 'num':3}"
'd'| "{'A':{'stock':'GOOG', 'price': 52}, B:{'date':1388620800, 'num':4}"
+--+--------------------------------------------------------------+
from pyspark.sql import SparkSession
from pyspark.sql.functions import create_map
spark = SparkSession.builder.appName("example").getOrCreate()
df = spark.createDataFrame([
(1388534400, "GOOG", 50, 'a', 1),
(1388534400, "FB", 60, 'b', 2),
(1388534400, "MSFT", 55, 'c', 3),
(1388620800, "GOOG", 52, 'd', 4)]
).toDF("date", "stock", "price", 'tag', 'num')
df.show()
tag_cols = {'A':['stock', 'price'], 'B':['date', 'num']}
# todo, change the Dataframe columns format
Upvotes: 1
Views: 794
Reputation: 13998
IIUC, just use pyspark.sql.functions.struct and pyspark.sql.functions.to_json (both should be available in spark 2.1)
from pyspark.sql import functions as F
# skip df initialization[enter link description here][1]
df_new = df.withColumn('A', F.struct('stock', 'price')) \
.withColumn('B', F.struct('date', 'num')) \
.select('tag', F.to_json(F.struct('A', 'B')).alias('data'))
>>> df_new.show(5,0)
+---+-----------------------------------------------------------------+
|tag|data |
+---+-----------------------------------------------------------------+
|a |{"A":{"stock":"GOOG","price":50},"B":{"date":1388534400,"num":1}}|
|b |{"A":{"stock":"FB","price":60},"B":{"date":1388534400,"num":2}} |
|c |{"A":{"stock":"MSFT","price":55},"B":{"date":1388534400,"num":3}}|
|d |{"A":{"stock":"GOOG","price":52},"B":{"date":1388620800,"num":4}}|
+---+-----------------------------------------------------------------+
Upvotes: 1