Anastasia
Anastasia

Reputation: 874

Pyspark: groupby and then count true values

My data structure is in JSON format:

"header"{"studentId":"1234","time":"2016-06-23","homeworkSubmitted":True}
"header"{"studentId":"1234","time":"2016-06-24","homeworkSubmitted":True}
"header"{"studentId":"1234","time":"2016-06-25","homeworkSubmitted":True}
"header"{"studentId":"1236","time":"2016-06-23","homeworkSubmitted":False}
"header"{"studentId":"1236","time":"2016-06-24","homeworkSubmitted":True}
....

I need to plot a histogram that shows number of homeworkSubmitted: True over all stidentIds. I wrote code that flattens the data structure, so my keys are header.studentId, header.time and header.homeworkSubmitted.

I used keyBy to group by studentId:

    initialRDD.keyBy(lambda row: row['header.studentId'])
              .map(lambda (k,v): (k,v['header.homeworkSubmitted']))
              .map(mapTF).groupByKey().mapValues(lambda x: Counter(x)).collect()

This gives me result like this:

("1234", Counter({0:0, 1:3}),
("1236", Counter(0:1, 1:1))

I need only number of counts of 1, possibly mapped to a list so that I can plot a histogram using matplotlib. I am not sure how to proceed and filter everything.

Edit: at the end I iterated through the dictionary and added counts to a list and then plotted histogram of the list. I am wondering if there is a more elegant way to do the whole process I outlined in my code.

Upvotes: 12

Views: 72650

Answers (3)

Alexis Benichoux
Alexis Benichoux

Reputation: 800

You can filter out the false, keeping it in RDD, then count the True with counter

initialRDD.filter(lambda row : row['header.homeworkSubmitted'])

Another solution would be to sum the booleans

data = sc.parallelize([('id1',True),('id1',True),
                    ('id2',False),
                    ('id2',False),('id3',False),('id3',True) ])


data.reduceByKey(lambda x,y:x+y).collect()

Outputs

[('id2', 0), ('id3', 1), ('id1', 2)]

Upvotes: 0

shuaiyuancn
shuaiyuancn

Reputation: 2794

df = sqlContext.read.json('/path/to/your/dataset/')
df.filter(df.homeworkSubmitted == True).groupby(df.studentId).count()

Note it is not valid JSON if there is a "header" or True instead of true

Upvotes: 26

Jeff
Jeff

Reputation: 2228

I don't have Spark in front of me right now, though I can edit this tomorrow when I do.

But if I'm understanding this you have three key-value RDDs, and need to filter by homeworkSubmitted=True. I would think you turn this into a dataframe, then use:

df.where(df.homeworkSubmitted==True).count()

You could then use group by operations if you wanted to explore subsets based on the other columns.

Upvotes: 1

Related Questions