Where does EMR store Spark stdout?

I am running my Spark application on EMR, and have several println() statements. Other than the console, where do these statements get logged?

My S3 aws-logs directory structure for my cluster looks like:

node ├── i-0031cd7a536a42g1e │ ├── applications │ ├── bootstrap-actions │ ├── daemons │ ├── provision-node │ └── setup-devices containers/ ├── application_12341331455631_0001 │ ├── container_12341331455631_0001_01_000001

Upvotes: 10

Answers (2)

ayplam

Reputation: 1963

You can find println's in a few places:

Resource Manager -> Your Application -> Logs -> stdout
Your S3 log directory -> containers/application_.../container_.../stdout (though this takes a few minutes to populate after the application)
SSH into the EMR, yarn logs -applicationId <Application ID> -log_files <log_file_type>

Upvotes: 14

xmorera

Reputation: 1961

There is a very important thing that you need to consider when printing from Spark: are you running code that gets executed in the driver or is it code that runs in the executor?

For example, if you do the following, it will output in the console as you are bringing data back to the driver:

for i in your_rdd.collect():
    print i

But the following will run within an executor and thus it will be written in the Spark logs:

def run_in_executor(value):
    print value

your_rdd.map(lambda x: value(x))

Now going to your original question, the second case will write to the log location. Logs are usually written to the master node which is located in /mnt/var/log/hadoop/steps, but it might be better to configure logs to an s3 bucket with --log-uri. That way it will be easier to find.

Upvotes: 3

Where does EMR store Spark stdout?

Answers (2)

Related Questions