knc
knc

Reputation: 133

Stackdriver logging for PySpark

I have a PySpark job running on Google Dataproc.

The goal is to have the application-level logs for this job in Stackdriver logs, and create metrics from them.

How can I achieve this? I've already changed spark's log4j properties to write to /var/log/spark/spark-out.log, but that file doesn't seem to contain the proper data.

Upvotes: 1

Views: 459

Answers (1)

John Mikula
John Mikula

Reputation: 41

If you log to /var/log/spark/spark-out.log, it will make its way into Stackdriver Logging.

Logging from a worker to sys.stderr or sys.stdout will likewise be collected by Stackdriver Logging as yarn-userlogs.

Output from the driver itself is not collected into Stackdriver Logging by default. That said, one could configure a file-based logger in the driver to log to /var/log/spark/<filename>.log, and Stackdriver would pick up that file as well. From the config file:

# Fluentd config to tail the hadoop, hive, and spark message log.
# Currently severity is a seperate field from the Cloud Logging log_level.
<source>
    type tail
    format multi_format
    <pattern>
        format /^((?<time>[^ ]* [^ ]*) *(?<severity>[^ ]*) *(?<class>[^ ]*): (?<message>.*))/
        time_format %Y-%m-%d %H:%M:%S,%L
    </pattern>
    <pattern>
        format none
    </pattern>
    path /var/log/hadoop*/*.log,/var/log/hive/*.log,/var/log/spark/*.log,
    pos_file /var/tmp/fluentd.dataproc.hadoop.pos
    refresh_interval 2s
    read_from_head true
    tag raw.tail.*
</source>

Upvotes: 2

Related Questions