hawarden_
hawarden_

Reputation: 2160

Spark: show dataframe content in logging (Java)

I want to know how to show dataframe content (rows) in Java ? I tried using log.info(df.showString()), but it prints unreadable characters. I want to use df.collectAsList() but I have to do filter afterwards so I can't do this.

Thank you.

Upvotes: 0

Views: 3490

Answers (1)

werner
werner

Reputation: 14845

There are several options to log the data:

Collecting the data to the driver

You can call collectAsList() and continue processing afterwards. Spark datasets are immutable, so collecting them to the driver will trigger the execution, but you can re-use the dataset afterwards for further processing steps:

Dataset<Data> ds = ... //1
List<Data> collectedDs = ds.collectAsList(); //2
doSomeLogging(collectedDs);
ds = ds.filter(<filter condition>); //3
ds.show();

The code above will collect the data in line //2, log it and then continue processing in line //3.

Depending on how complex the creation of the dataset in line //1 was, you would want to cache the dataset, so that the processing in line //1 is run only once.

Dataset<Data> ds = ... //1
ds = ds.cache();
List<Data> collectedDs = ds.collectAsList(); //2
....

Using map

Calling collectAsList() will send all your data to the driver. Usually you use Spark in order to distribute the data over several executor nodes, so your driver might not be large enough to hold all of the data at the same time. In this case, you can log the data in a map call:

Dataset<Data> ds = ... //1
ds = ds.map(d -> {
  System.out.println(d); //2
  return d; //3
}, Encoders.bean(Data.class));
ds = ds.filter(<filter condition>);
ds.show();

In this example, line //2 does the logging and line //3 simply returns the original object, so that the dataset remains unchanged. I assume that the Data class comes with a readable toString() implementation. Otherwise, line //2 needs some more logic. It always might be helpful to a log library (like log4j) instead of writing directly to standard out.

In this second approach, the logs will not be written on the driver but on each executor. You would have to collect the logs after the Spark job has finished and combine them into one file.


If you have an untyped dataframe instead of a dataset like above, the same code would work. You only would have to operate directly on a Row object using the getXXX methods for creating logging output instead of the Data class.

All logging operations will have an impact on the performance of your code.

Upvotes: 2

Related Questions