Reputation: 83
Does anyone know how does Spark compute its number of records (I think it is the same as the number of events in a batch), as displayed here?
I'm trying to figure out how I can get this value remotely (REST-API does not exist for Streaming option in the UI).
Basically what I'm trying to do it to get the total number of records processed by my application. I need this information for the web portal.
I tried to count the Records
for each stage, but it gave me completely different number as it is at the picture above. Each stage contain the infomation about its records. As shown here
I'm using this short python script to count the "inputRecords", from each stage. This is the source code:
import json, requests, urllib
print "Get stages script started!"
#URL REST-API
url = 'http://10.16.31.211:4040/api/v1/applications/app-20161104125052-0052/stages/'
response = urllib.urlopen(url)
data = json.loads(response.read())
stages = []
print len(data)
inputCounter = 0
for item in data:
stages.append(item["stageId"])
inputCounter += item["inputRecords"]
print "Records processed: " + str(inputCounter)
If I understood it correctly: Each Batch
has one Job
, and each Job
has multiple Stages
, these Stages
have multiple Tasks
.
So for me it made sense to count the input for each Stage
.
Upvotes: 2
Views: 4872
Reputation: 37435
Spark offers a metrics endpoint on the driver:
<driver-host>:<ui-port>/metrics/json
A Spark Streaming application will report all metrics available in the UI and some more. The ones you are potentially looking for are:
<driver-id>.driver.<job-id>.StreamingMetrics.streaming.totalProcessedRecords: {
value: 48574640
},
<driver-id>.driver.<job-id>.StreamingMetrics.streaming.totalReceivedRecords: {
value: 48574640
}
This endpoint can be customized. See Spark Metrics for info.
Upvotes: 5