Reputation: 874
My data structure is in JSON format:
"header"{"studentId":"1234","time":"2016-06-23","homeworkSubmitted":True}
"header"{"studentId":"1234","time":"2016-06-24","homeworkSubmitted":True}
"header"{"studentId":"1234","time":"2016-06-25","homeworkSubmitted":True}
"header"{"studentId":"1236","time":"2016-06-23","homeworkSubmitted":False}
"header"{"studentId":"1236","time":"2016-06-24","homeworkSubmitted":True}
....
I need to plot a histogram that shows number of homeworkSubmitted: True over all stidentIds. I wrote code that flattens the data structure, so my keys are header.studentId, header.time and header.homeworkSubmitted.
I used keyBy to group by studentId:
initialRDD.keyBy(lambda row: row['header.studentId'])
.map(lambda (k,v): (k,v['header.homeworkSubmitted']))
.map(mapTF).groupByKey().mapValues(lambda x: Counter(x)).collect()
This gives me result like this:
("1234", Counter({0:0, 1:3}),
("1236", Counter(0:1, 1:1))
I need only number of counts of 1, possibly mapped to a list so that I can plot a histogram using matplotlib. I am not sure how to proceed and filter everything.
Edit: at the end I iterated through the dictionary and added counts to a list and then plotted histogram of the list. I am wondering if there is a more elegant way to do the whole process I outlined in my code.
Upvotes: 12
Views: 72650
Reputation: 800
You can filter out the false, keeping it in RDD, then count the True with counter
initialRDD.filter(lambda row : row['header.homeworkSubmitted'])
Another solution would be to sum the booleans
data = sc.parallelize([('id1',True),('id1',True),
('id2',False),
('id2',False),('id3',False),('id3',True) ])
data.reduceByKey(lambda x,y:x+y).collect()
Outputs
[('id2', 0), ('id3', 1), ('id1', 2)]
Upvotes: 0
Reputation: 2794
df = sqlContext.read.json('/path/to/your/dataset/')
df.filter(df.homeworkSubmitted == True).groupby(df.studentId).count()
Note it is not valid JSON if there is a "header"
or True
instead of true
Upvotes: 26
Reputation: 2228
I don't have Spark in front of me right now, though I can edit this tomorrow when I do.
But if I'm understanding this you have three key-value RDDs, and need to filter by homeworkSubmitted=True. I would think you turn this into a dataframe, then use:
df.where(df.homeworkSubmitted==True).count()
You could then use group by operations if you wanted to explore subsets based on the other columns.
Upvotes: 1