Reputation: 153

How to redirect Apache Spark logs from the driver and the slaves to the console of the machine that launchs the Spark job using log4j?

I'm trying to build an Apache Spark application that normalizes csv files from HDFS (changes delimiter, fix broken lines). I use log4j for logging but all the logs just print in the executors so the only way i can check them is using yarn logs -applicationId command. Is there any way i can redirect all logs( from driver and from executors) to my gateway node(the one which launchs the spark job) so i can check them during execution?

Upvotes: 0

Answers (4)

user238607

Reputation: 2468

There is an indirect way to achieve. Enable the following property in yarn-site.xml.

<property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
</property>

This will store all your logs of the submitted applications in hdfs location. Then using the following command you can download the logs into a single aggregated file.

yarn logs -applicationId application_id_example > app_logs.txt

I came across this github repo which downloads the driver and container logs separately. Clone this repository : https://github.com/hammerlab/yarn-logs-helpers

git clone --recursive https://github.com/hammerlab/yarn-logs-helpers.git

In your .bashrc (or equivalent), source .yarn-logs-helpers.sourceme:

$ source /path/to/repo/.yarn-logs-helpers.sourceme

Then download the aggregated logs into nicely segregated driver and container logs by this command.

yarn-container-logs application_example_id

Upvotes: 0

OneCricketeer

Reputation: 191701

You should have the executors log4j props configured to write files local to themselves. Streaming back to the driver will cause unnecessary latency in processing.

If you plan on being able to 'tail" the logs in near real-time, you would need to instrument a solution like Splunk or Elasticsearch, and use tools like Splunk Forwarders, Fluentd, or Filebeat that are agents on each box that specifically watch for all configured log paths, and push that data to a destination indexer, that'll parse and extract log field data.

Now, there are other alternatives like Streamsets or Nifi or Knime (all open source), which offer more instrumentation for collecting event processing failures, and effectively allow for "dead letter queues" to handle errors in a specific way. The part I like about those tools - no programming required.

Upvotes: 1

ryandam

Reputation: 714

As per https://spark.apache.org/docs/preview/running-on-yarn.html#configuration,

YARN has two modes for handling container logs after an application has completed. If log aggregation is turned on (with the yarn.log-aggregation-enable config in yarn-site.xml file), container logs are copied to HDFS and deleted on the local machine.

You can also view the container log files directly in HDFS using the HDFS shell or API. The directory where they are located can be found by looking at your YARN configs (yarn.nodemanager.remote-app-log-dir and yarn.nodemanager.remote-app-log-dir-suffix in yarn-site.xml).

I am not sure whether the log aggregation from worker nodes happen in real time !!

Upvotes: 0

Rajesh

Reputation: 71

i think it is not possible. When you execute spark in local mode you can able to see it in console. Otherwise you have to alter log4j properties for the log file path.

Upvotes: -1

How to redirect Apache Spark logs from the driver and the slaves to the console of the machine that launchs the Spark job using log4j?

Answers (4)

Related Questions