Mukund S
Mukund S

Reputation: 95

Spark - Parse json file which contains additional text

My JSON file has many lines and every line looks like this.

Mon Jan 20 00:00:00 -0800 2014, {"cl":"js","ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36","ip":"76.4.253.137","cc":"US","rg":"NV","ct":"North Las Vegas","pc":"89084","mc":839,"bf":"402d6c3bdd18e5b5f6541a98a01ecc47d698420d","vst":"0e1c96ff-1f4a-4279-bfdc-ba3fe51c2a4e","lt":"Sun Jan 19 23:59:59 -0800 2014","hk":["memba","alyson stoner","memba them","member them","member them 80s","missy elliotts","www.tmzmembathem","80s memba then","missy elliott","mini"]}, 

/added space for the purpose of clarity/

{"v":"1.1","pv":"7963ee21-0d09-4924-b315-ced4adad425f","r":"v3","t":"tmzdtcom","a":[{"i":15,"u":"ll-media.tmz.com/2012/10/03/100312-alyson-stoner-then-480w.jpg","w":523,"h":480,"x":503,"y":651,"lt":"none","af":false}],"rf":"http://www.zergnet.com/news/128786/stars-whove-changed-a-lot-since-you-last-saw-them","p":"www.tmz.com/photos/2007/12/20/740-memba-them/images/2012/10/03/100312-alyson-stoner-then-jpg/","fs":true,"tr":0.7,"ac":{},"vp":{"ii":false,"w":1915,"h":1102},"sc":{"w":1920,"h":1200,"d":1},"pid":239,"vid":1,"ss":"0.5"}

I tried the following:

Method 1 :

val value1 = sc.textFile(filename).map(_.substring(32))

val df = sqlContext.read.json(value1)

Here I am trying to omit the text which is at the start of the line.In this case, I am getting only the first json object from each line.

That is :

{"cl":"js","ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36","ip":"76.4.253.137","cc":"US","rg":"NV","ct":"North Las Vegas","pc":"89084","mc":839,"bf":"402d6c3bdd18e5b5f6541a98a01ecc47d698420d","vst":"0e1c96ff-1f4a-4279-bfdc-ba3fe51c2a4e","lt":"Sun Jan 19 23:59:59 -0800 2014","hk":["memba","alyson stoner","memba them","member them","member them 80s","missy elliotts","www.tmzmembathem","80s memba then","missy elliott","mini"]}

Method 2 :

val df = sqlContext.read.json(sc.wholeTextFiles(filename).values) 

In this case, I am just getting the output as a corrupt record.

Could you please tell me what is the problem here and how to parse this kind of file?

Upvotes: 0

Views: 888

Answers (1)

Vidya
Vidya

Reputation: 30300

Your sqlContext.read.json only works with complete JSON entries that appear line-by-line in a file rather than expanded or "pretty-printed." Your best bet is to do this:

val jsonRDD = sparkContext.wholeTextFiles(fileName).map(_._2)

As stated in the documentation, wholeTextFiles returns RDD[(String, String)] where the first entry in each Tuple2 is the file name and the second is the contents. Only the second is what you care about, so you access the contents with ._2.

Then you can convert the RDD to a DataFrame and apply to_json to the contents as described here:

val jsonDF = sparkContext
   .wholeTextFiles(fileName)
   .map(_._2)
   .toDF("json")
   .select(to_json('json))

Upvotes: 1

Related Questions