Reputation: 133
I have a PySpark job running on Google Dataproc.
The goal is to have the application-level logs for this job in Stackdriver logs, and create metrics from them.
How can I achieve this? I've already changed spark's log4j properties to write to /var/log/spark/spark-out.log, but that file doesn't seem to contain the proper data.
Upvotes: 1
Views: 459
Reputation: 41
If you log to /var/log/spark/spark-out.log
, it will make its way into Stackdriver Logging.
Logging from a worker to sys.stderr
or sys.stdout
will likewise be collected by Stackdriver Logging as yarn-userlogs
.
Output from the driver itself is not collected into Stackdriver Logging by default. That said, one could configure a file-based logger in the driver to log to /var/log/spark/<filename>.log
, and Stackdriver would pick up that file as well. From the config file:
# Fluentd config to tail the hadoop, hive, and spark message log.
# Currently severity is a seperate field from the Cloud Logging log_level.
<source>
type tail
format multi_format
<pattern>
format /^((?<time>[^ ]* [^ ]*) *(?<severity>[^ ]*) *(?<class>[^ ]*): (?<message>.*))/
time_format %Y-%m-%d %H:%M:%S,%L
</pattern>
<pattern>
format none
</pattern>
path /var/log/hadoop*/*.log,/var/log/hive/*.log,/var/log/spark/*.log,
pos_file /var/tmp/fluentd.dataproc.hadoop.pos
refresh_interval 2s
read_from_head true
tag raw.tail.*
</source>
Upvotes: 2