Mike
Mike

Reputation: 593

Visualization of data from dataframe in (Py)Spark framework

Question about visualization of Spark DataFrames methods.

As for now (I use v. 2.0.0) , Spark DataFrames do not have any visualization functionality (yet). Usually the solution is to collect some sample of the DataFrame into the driver, load it into, for instance, Pandas DataFrame, and use its visualization capabilities.

My question is: How do I know what is the optimal sampling size to maximally utilize the driver's memory, in order to visualize the data? Or, what is the best practice to work around this issue?

Thanks!

Upvotes: 5

Views: 8996

Answers (2)

Ajay Kharade
Ajay Kharade

Reputation: 1525

There is visualization tool on top of Spark SQL(Dataframes), for that you can use Apache Zeppelin notebook which is open source notebook, where you can able see the visualization of results in graphical format.

Good thing about this notebook, it has build in support for spark integration, so there no efforts required to configuration. As far as other question concern, about memory sampling zeppenlin notebook readily available with this. For more information about, zeppenlin notebook Spark support refer this link.

Upvotes: 0

Paulius Baranauskas
Paulius Baranauskas

Reputation: 343

I don't think this will answer your question, but hopefully, it will give some perspective for others, or maybe you.

I usually aggregate on spark and then use Pandas to visualize (but do not store it to a variable). In example (simplified), I would count active users per day and then only this count collect and visualize through Pandas (when possible, I try to avoid saving data to variable):

(
spark.table("table_name")
.filter(F.col("status") == "Active")
.groupBy("dt")
.count()
.toPandas()
.plot(x="dt", y="count")
)

Upvotes: 2

Related Questions