Reputation: 2160
I want to know how to show dataframe content (rows) in Java ?
I tried using log.info(df.showString())
, but it prints unreadable characters. I want to use df.collectAsList()
but I have to do filter afterwards so I can't do this.
Thank you.
Upvotes: 0
Views: 3490
Reputation: 14845
There are several options to log the data:
You can call collectAsList()
and continue processing afterwards. Spark datasets are immutable, so collecting them to the driver will trigger the execution, but you can re-use the dataset afterwards for further processing steps:
Dataset<Data> ds = ... //1
List<Data> collectedDs = ds.collectAsList(); //2
doSomeLogging(collectedDs);
ds = ds.filter(<filter condition>); //3
ds.show();
The code above will collect the data in line //2
, log it and then continue processing in line //3
.
Depending on how complex the creation of the dataset in line //1
was, you would want to cache the dataset, so that the processing in line //1
is run only once.
Dataset<Data> ds = ... //1
ds = ds.cache();
List<Data> collectedDs = ds.collectAsList(); //2
....
Calling collectAsList()
will send all your data to the driver. Usually you use Spark in order to distribute the data over several executor nodes, so your driver might not be large enough to hold all of the data at the same time. In this case, you can log the data in a map
call:
Dataset<Data> ds = ... //1
ds = ds.map(d -> {
System.out.println(d); //2
return d; //3
}, Encoders.bean(Data.class));
ds = ds.filter(<filter condition>);
ds.show();
In this example, line //2
does the logging and line //3
simply returns the original object, so that the dataset remains unchanged. I assume that the Data
class comes with a readable toString()
implementation. Otherwise, line //2
needs some more logic. It always might be helpful to a log library (like log4j) instead of writing directly to standard out.
In this second approach, the logs will not be written on the driver but on each executor. You would have to collect the logs after the Spark job has finished and combine them into one file.
If you have an untyped dataframe instead of a dataset like above, the same code would work. You only would have to operate directly on a Row object using the getXXX
methods for creating logging output instead of the Data
class.
All logging operations will have an impact on the performance of your code.
Upvotes: 2