Reputation: 7409
Livy has a batch log endpoint: GET /batches/{batchId}/log
, pointed out in How to pull Spark jobs client logs submitted using Apache Livy batches POST method using AirFlow
As far as I can tell, these logs are the livy logs and not the spark driver logs. I have a print
statement in a pyspark
job which prints to driver log stdout.
I am able to find the driver log URL via the describe batch endpoint https://livy.incubator.apache.org/docs/latest/rest-api.html#batch
: by visiting the json response['appInfo']['driverLogUrl']
URL and clicking through to the logs
The json response url looks like : http://ip-some-ip.emr.masternode:8042/node/containerlogs/container_1578061839438_0019_01_000001/livy/
and I can click through to an html page with the added url leaf: stdout/?start=-4096
to see the logs...
As it is, I can only get an HTML page of the stdout, does a JSON API like version of this stdout (and preferrably stderr too) exist in the yarn/emr/hadoop
resource manager? Otherwise is livy
able to retrieve these driver logs somehow?
Or, is this an issue because I am using cluster
mode instead of client
. When I try to use client
mode, I've been unable to use python3
and the PYSPARK_PYTHON
, which is maybe for a different question, but if I'm able to get the stdout of the driver using a different deployMode
, then that would work too.
If it matters, I'm running the cluster with EMR
Upvotes: 3
Views: 2277
Reputation: 121
I meet the same problem. The short answer is it will only work for the client mode, but not the cluster mode.
This is because we try to get all logs from the master node. But the print information is local and is from the driver node.
When the spark is running in the "client mode", the driver node is your master node, so we get both log info and print info as they are in the same physical machine
However, things are different when spark is running in the "cluster mode". In this case, the driver node is one of your worker node, not your master node. Therefore we lose the print info since livy only get info from the master node
Upvotes: 1
Reputation: 83
You can fetch the all logs including stdout, stderr and yarn diagnostics by GET /batches/{batchId}
. (as you can see through at a batch log endpoint)
Here are code examples:
# self.job is batch object returned by `POST /batches`
job_response = requests.get(self.job, headers=self.headers).json()
self.job_status = job_response['state']
print(f"Job status: {self.job_status}")
for log in job_response['log']:
print(log)
Printed logs are like this (note that it is a Spark job logs, not a livy logs):
20/01/10 05:28:57 INFO Client: Application report for application_1578623516978_0024 (state: ACCEPTED)
20/01/10 05:28:58 INFO Client: Application report for application_1578623516978_0024 (state: ACCEPTED)
20/01/10 05:28:59 INFO Client: Application report for application_1578623516978_0024 (state: RUNNING)
20/01/10 05:28:59 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.2.100.6
ApplicationMaster RPC port: -1
queue: default
start time: 1578634135032
final status: UNDEFINED
tracking URL: http://ip-10-2-100-176.ap-northeast-2.compute.internal:20888/proxy/application_1578623516978_0024/
user: livy
20/01/10 05:28:59 INFO YarnClientSchedulerBackend: Application application_1578623516978_0024 has started running.
20/01/10 05:28:59 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38087.
20/01/10 05:28:59 INFO NettyBlockTransferService: Server created on ip-10-2-100-176.ap-northeast-2.compute.internal:38087
20/01/10 05:28:59 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/01/10 05:28:59 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, ip-10-2-100-176.ap-northeast-2.compute.internal, 38087, None)
20/01/10 05:28:59 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-2-100-176.ap-northeast-2.compute.internal:38087 with 5.4 GB RAM, BlockManagerId(driver, ip-10-2-100-176.ap-northeast-2.compute.internal, 38087, None)
20/01/10 05:28:59 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, ip-10-2-100-176.ap-northeast-2.compute.internal, 38087, None)
20/01/10 05:28:59 INFO BlockManager: external shuffle service port = 7337
20/01/10 05:28:59 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, ip-10-2-100-176.ap-northeast-2.compute.internal, 38087, None)
20/01/10 05:28:59 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> ip-10-2-100-176.ap-northeast-2.compute.internal, PROXY_URI_BASES -> http://ip-10-2-100-176.ap-northeast-2.compute.internal:20888/proxy/application_1578623516978_0024), /proxy/application_1578623516978_0024
20/01/10 05:28:59 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /jobs, /jobs/json, /jobs/job, /jobs/job/json, /stages, /stages/json, /stages/stage, /stages/stage/json, /stages/pool, /stages/pool/json, /storage, /storage/json, /storage/rdd, /storage/rdd/json, /environment, /environment/json, /executors, /executors/json, /executors/threadDump, /executors/threadDump/json, /static, /, /api, /jobs/job/kill, /stages/stage/kill.
20/01/10 05:28:59 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /metrics/json.
20/01/10 05:28:59 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM)
20/01/10 05:28:59 INFO EventLoggingListener: Logging events to hdfs:/var/log/spark/apps/application_1578623516978_0024
20/01/10 05:28:59 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
20/01/10 05:28:59 INFO SharedState: loading hive config file: file:/etc/spark/conf.dist/hive-site.xml
...
Please check the Livy docs for REST API for further information.
Upvotes: 0