Reputation: 129
I'm new to Spark and I'm trying to make a sense of how it works. I'm trying to create a table(DataFrame) which has the column based on the json file I read. What I've done is :
val example = mysession.read.json("/FileStore/tables/bm2982090/exampleLog_dp_api-fac53.json")
As the documentation says, this returns a DataFrame. Now the DataFrame looks like this :
--------------------+-----------------+------+--------------------+-----+
| _id| _index|_score| _source|_type|
+--------------------+-----------------+------+--------------------+-----+
|AVzO9dqvoaL5S78GvkQU|dp_api-2017.06.22| 1|[2017-06-22T08:40...|DPAPI|
+--------------------+-----------------+------+--------------------+-----+
and as you can see, is doing a good job creating the column _id, index
etc BUT when it comes to _source
is putting everything inside and this happened because the file I've red looks like this:
{
"_index":"dp_api-2017.06.22",
"_type":"DPAPI",
"_id":"AVzO9dqvoaL5S78GvkQU",
"_score":1,
"_source":{
"request":{
"user_ip":"ip",
"method":"POST",
"user_agent":"agent",
"ts":"2017-06-22T10:40:37.420651+02:00"
},
"svc":"example",
"cipher_info":{
"userdata":[
"field:parameters.username,start:0,end:0"
]
},
"message":"{\"class\":\"DPAPI\",\"event\":\"druid.auth.login\",\"operation\":{\"result\":{\"code\":200,\"description\":\"Success\",\"http_status\":200}},\"parameters\":,\"request\":{\"app_instance\":\"e83f99ff-a768-44d2-a448-9b51a535183f-1498034708\",\"end_point\":\"/oken\",\"method\":\"POST\",\"ts\":\"2017-06-22T10:40:37.420651+02:00\",\"user_agent\":\"Dr.1/Sa-SM-N\"},\"svc\":\"c-1\",\"ts\":\"2017-06-22T10:40:37.420614+02:00\"}",
"type":"DPAPI",
"tags":[
"parse_to_json"
],
"index_name":"dp_api",
"class":"DPAPI",
"operation":{
"result":{
"code":200,
"description":"Success",
"http_status":200
}
},
"parameters":{
"username":"Xp3opAyI0udKPuSQq5gqkQ=="
},
"event_age":0.37999987602233887,
"ts":"2017-06-22T10:40:37.420614+02:00"
}
}
so what I think is that when it comes to nested value then the process is "lazy" and it doesn't create the column correctly. My question is: Is it possible to extract the content of _source
before creating the DataFrame? Or doing it after the dataframe is created, is fine but the output should look like that I have a table which has as many column as the keys.
For example :
--------------------+-----------------+------+--------------------+-----+
| _id| _index|_score| _source.request.user_ip|_type|
+--------------------+-----------------+------+--------------------+-----+
|AVzO9dqvoaL5S78GvkQU|dp_api-2017.06.22| 1|ip |DPAPI|
+--------------------+-----------------+------+--------------------+-----+
hope I explained well! thanks
Upvotes: 0
Views: 286
Reputation: 41957
I guess extracting after you create dataframe
should be efficient and faster than extracting before you create a dataframe
.
You have used inbuilt functions
and APIs
for creating dataframe
from json. If you extract before that, then you will have to write some custom APIs
yourself which won't be as efficient as the provided APIs
.
You can just select
the columns
you require after the dataframe
is created as below
example.select("_id", "_index", "_score", "_source.request.user_ip", "_type").show(false)
Which should give you the dataframe
as you require
+--------------------+-----------------+------+-------+-----+
|_id |_index |_score|user_ip|_type|
+--------------------+-----------------+------+-------+-----+
|AVzO9dqvoaL5S78GvkQU|dp_api-2017.06.22|1 |ip |DPAPI|
+--------------------+-----------------+------+-------+-----+
I hope I answered your query well.
Upvotes: 2