Reputation: 23
I have .log file in ADLS which contain multiple nested Json objects as follows
{"EventType":3735091736,"Timestamp":"2019-03-19","Data":{"Id":"event-c2","Level":2,"MessageTemplate":"Test1","Properties":{"CorrId":"d69b7489","ActionId":"d0e2c3fd"}},"Id":"event-c20b9c7eac0808d6321106d901000000"}
{"EventType":3735091737,"Timestamp":"2019-03-18","Data":{"Id":"event-d2","Level":2,"MessageTemplate":"Test1","Properties":{"CorrId":"f69b7489","ActionId":"d0f2c3fd"}},"Id":"event-d20b9c7eac0808d6321106d901000000"}
{"EventType":3735091738,"Timestamp":"2019-03-17","Data":{"Id":"event-e2","Level":1,"MessageTemplate":"Test1","Properties":{"CorrId":"g69b7489","ActionId":"d0d2c3fd"}},"Id":"event-e20b9c7eac0808d6321106d901000000"}
Need to read the above multiple nested Json objects in pyspark and convert to dataframe as follows
EventType Timestamp Data.[Id] ..... [Data.Properties.CorrId] [Data.Properties. ActionId]
3735091736 2019-03-19 event-c2 ..... d69b7489 d0e2c3fd
3735091737 2019-03-18 event-d2 ..... f69b7489 d0f2c3fd
3735091738 2019-03-17 event-e2 ..... f69b7489 d0d2c3fd
For above I am using ADLS,Pyspark in Azure DataBricks.
Does anyone know a general way to deal with above problem? Thanks!
Upvotes: 1
Views: 2103
Reputation: 23
I was able to read the data by following code.
from pyspark.sql.functions import *
DF = spark.read.json("demo_files/Test20191023.log")
DF.select(col('Id'),col('EventType'),col('Timestamp'),col('Data.Id'),col('Data.Level'),col('Data.MessageTemplate'),
col('Data.Properties.CorrId'),col('Data.Properties.ActionId'))\
.show()```
***Result***
+--------------------+----------+----------+--------+-----+---------------+--------+--------+
| Id| EventType| Timestamp| Id|Level|MessageTemplate| CorrId|ActionId|
+--------------------+----------+----------+--------+-----+---------------+--------+--------+
|event-c20b9c7eac0...|3735091736|2019-03-19|event-c2| 2| Test1|d69b7489|d0e2c3fd|
|event-d20b9c7eac0...|3735091737|2019-03-18|event-d2| 2| Test1|f69b7489|d0f2c3fd|
|event-e20b9c7eac0...|3735091738|2019-03-17|event-e2| 1| Test1|g69b7489|d0d2c3fd|
+--------------------+----------+----------+--------+-----+---------------+--------+--------+
Upvotes: 0
Reputation: 7399
json.loads()
toDF()
Data
column into multiple columns. Given your Id
column is going to be unique. Note that, explode would return key
, value
columns for each entry in the map type. properties
column. Solution:
import json
rdd = sc.textFile("demo_files/Test20191023.log")
df = rdd.map(lambda x: json.loads(x)).toDF()
df.show()
# +--------------------+----------+--------------------+----------+
# | Data| EventType| Id| Timestamp|
# +--------------------+----------+--------------------+----------+
# |[MessageTemplate ...|3735091736|event-c20b9c7eac0...|2019-03-19|
# |[MessageTemplate ...|3735091737|event-d20b9c7eac0...|2019-03-18|
# |[MessageTemplate ...|3735091738|event-e20b9c7eac0...|2019-03-17|
# +--------------------+----------+--------------------+----------+
data_exploded = df.select('Id', 'EventType', "Timestamp", F.explode('Data'))\
.groupBy('Id', 'EventType', "Timestamp").pivot('key').agg(F.first('value'))
# There is a duplicate Id column and might cause ambiguity problems
data_exploded.show()
# +--------------------+----------+----------+--------+-----+---------------+--------------------+
# | Id| EventType| Timestamp| Id|Level|MessageTemplate| Properties|
# +--------------------+----------+----------+--------+-----+---------------+--------------------+
# |event-c20b9c7eac0...|3735091736|2019-03-19|event-c2| 2| Test1|{CorrId=d69b7489,...|
# |event-d20b9c7eac0...|3735091737|2019-03-18|event-d2| 2| Test1|{CorrId=f69b7489,...|
# |event-e20b9c7eac0...|3735091738|2019-03-17|event-e2| 1| Test1|{CorrId=g69b7489,...|
# +--------------------+----------+----------+--------+-----+---------------+--------------------+
Upvotes: 1