ennon
ennon

Reputation: 129

How to extract data after mapped DataFrame in scala

I'm new to Spark and I'm trying to make a sense of how it works. I'm trying to create a table(DataFrame) which has the column based on the json file I read. What I've done is :

val example = mysession.read.json("/FileStore/tables/bm2982090/exampleLog_dp_api-fac53.json")

As the documentation says, this returns a DataFrame. Now the DataFrame looks like this :

--------------------+-----------------+------+--------------------+-----+
|                 _id|           _index|_score|             _source|_type|
+--------------------+-----------------+------+--------------------+-----+
|AVzO9dqvoaL5S78GvkQU|dp_api-2017.06.22|     1|[2017-06-22T08:40...|DPAPI|
+--------------------+-----------------+------+--------------------+-----+

and as you can see, is doing a good job creating the column _id, index etc BUT when it comes to _source is putting everything inside and this happened because the file I've red looks like this:

{
  "_index":"dp_api-2017.06.22",
  "_type":"DPAPI",
  "_id":"AVzO9dqvoaL5S78GvkQU",
  "_score":1,
  "_source":{
    "request":{
      "user_ip":"ip",
      "method":"POST",
      "user_agent":"agent",
      "ts":"2017-06-22T10:40:37.420651+02:00"
    },
    "svc":"example",
    "cipher_info":{
      "userdata":[
        "field:parameters.username,start:0,end:0"
      ]
    },
    "message":"{\"class\":\"DPAPI\",\"event\":\"druid.auth.login\",\"operation\":{\"result\":{\"code\":200,\"description\":\"Success\",\"http_status\":200}},\"parameters\":,\"request\":{\"app_instance\":\"e83f99ff-a768-44d2-a448-9b51a535183f-1498034708\",\"end_point\":\"/oken\",\"method\":\"POST\",\"ts\":\"2017-06-22T10:40:37.420651+02:00\",\"user_agent\":\"Dr.1/Sa-SM-N\"},\"svc\":\"c-1\",\"ts\":\"2017-06-22T10:40:37.420614+02:00\"}",
    "type":"DPAPI",
    "tags":[
      "parse_to_json"
    ],
    "index_name":"dp_api",
    "class":"DPAPI",
    "operation":{
      "result":{
        "code":200,
        "description":"Success",
        "http_status":200
      }
    },
    "parameters":{
      "username":"Xp3opAyI0udKPuSQq5gqkQ=="
    },
    "event_age":0.37999987602233887,
    "ts":"2017-06-22T10:40:37.420614+02:00"
  }
}

so what I think is that when it comes to nested value then the process is "lazy" and it doesn't create the column correctly. My question is: Is it possible to extract the content of _source before creating the DataFrame? Or doing it after the dataframe is created, is fine but the output should look like that I have a table which has as many column as the keys. For example :

--------------------+-----------------+------+--------------------+-----+
|                 _id|           _index|_score| _source.request.user_ip|_type|
+--------------------+-----------------+------+--------------------+-----+
|AVzO9dqvoaL5S78GvkQU|dp_api-2017.06.22|     1|ip              |DPAPI|
+--------------------+-----------------+------+--------------------+-----+

hope I explained well! thanks

Upvotes: 0

Views: 286

Answers (1)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

I guess extracting after you create dataframe should be efficient and faster than extracting before you create a dataframe.

You have used inbuilt functions and APIs for creating dataframe from json. If you extract before that, then you will have to write some custom APIs yourself which won't be as efficient as the provided APIs.

You can just select the columns you require after the dataframe is created as below

example.select("_id", "_index", "_score", "_source.request.user_ip", "_type").show(false)

Which should give you the dataframe as you require

+--------------------+-----------------+------+-------+-----+
|_id                 |_index           |_score|user_ip|_type|
+--------------------+-----------------+------+-------+-----+
|AVzO9dqvoaL5S78GvkQU|dp_api-2017.06.22|1     |ip     |DPAPI|
+--------------------+-----------------+------+-------+-----+

I hope I answered your query well.

Upvotes: 2

Related Questions