Reputation: 153
I'm trying to build an Apache Spark application that normalizes csv files from HDFS (changes delimiter, fix broken lines). I use log4j for logging but all the logs just print in the executors so the only way i can check them is using yarn logs -applicationId command. Is there any way i can redirect all logs( from driver and from executors) to my gateway node(the one which launchs the spark job) so i can check them during execution?
Upvotes: 0
Views: 2371
Reputation: 2468
There is an indirect way to achieve. Enable the following property in yarn-site.xml
.
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
This will store all your logs of the submitted applications in hdfs location. Then using the following command you can download the logs into a single aggregated file.
yarn logs -applicationId application_id_example > app_logs.txt
I came across this github repo which downloads the driver and container logs separately. Clone this repository : https://github.com/hammerlab/yarn-logs-helpers
git clone --recursive https://github.com/hammerlab/yarn-logs-helpers.git
In your .bashrc (or equivalent), source .yarn-logs-helpers.sourceme:
$ source /path/to/repo/.yarn-logs-helpers.sourceme
Then download the aggregated logs into nicely segregated driver and container logs by this command.
yarn-container-logs application_example_id
Upvotes: 0
Reputation: 191701
You should have the executors log4j props configured to write files local to themselves. Streaming back to the driver will cause unnecessary latency in processing.
If you plan on being able to 'tail" the logs in near real-time, you would need to instrument a solution like Splunk or Elasticsearch, and use tools like Splunk Forwarders, Fluentd, or Filebeat that are agents on each box that specifically watch for all configured log paths, and push that data to a destination indexer, that'll parse and extract log field data.
Now, there are other alternatives like Streamsets or Nifi or Knime (all open source), which offer more instrumentation for collecting event processing failures, and effectively allow for "dead letter queues" to handle errors in a specific way. The part I like about those tools - no programming required.
Upvotes: 1
Reputation: 714
As per https://spark.apache.org/docs/preview/running-on-yarn.html#configuration,
YARN has two modes for handling container logs after an application has completed. If log aggregation is turned on (with the
yarn.log-aggregation-enable
config inyarn-site.xml
file), container logs are copied to HDFS and deleted on the local machine.You can also view the container log files directly in HDFS using the HDFS shell or API. The directory where they are located can be found by looking at your YARN configs (
yarn.nodemanager.remote-app-log-dir
andyarn.nodemanager.remote-app-log-dir-suffix
inyarn-site.xml
).
I am not sure whether the log aggregation from worker nodes happen in real time !!
Upvotes: 0
Reputation: 71
i think it is not possible. When you execute spark in local mode you can able to see it in console. Otherwise you have to alter log4j properties for the log file path.
Upvotes: -1