mmTmmR
mmTmmR

Reputation: 573

Plotly Express Box Plot Produces White Screen When Plotting Using a Large Dataframe

I have the following dataframe with 40M rows:

occ_status_pre = ["retired","unemployed","house person","financially independent","employed","student"]

test_df = pd.DataFrame(np.random.randint(0,100,size=(40000000, 4)), columns=["id","occupation_status","age","height"])

occ_status = []
for num in range(0,40000000):
        occ_status.append(random.choice(occ_status_pre))

test_df["occupation_status"] = occ_status
test_df.head()
   id occupation_status  age  height
0  32        unemployed   41      78
1  83           retired   35      99
2  77           retired   61      19
3   8      house person   28      64
4   6        unemployed   46      22

In Seaborn, I can successfully create a Box plot for the entire dataframe without any issue:

fig,ax = plt.subplots(figsize=(10,8))
ax = sns.boxplot(x="occupation_status",y="age",data=test_df)
plt.tight_layout()

enter image description here

However, if I try to recreate this same Box plot in Plotly 4.2 then it crashes my web browser.

Further investigation led me to the pio.renderers attribute. If I set pio.renderers to equal "browser" then it outputs the Box plot visualisation to a new browser tab:

fig = px.box(test_df,x="occupation_status",y="age")
fig.show(renderer="browser")

However, if the row count of my dataframe is more than 28M rows then this will only display a blank white screen - no visualisation ever appears in the new tab.

From further investigation, it didn't seem to matter if I had more columns in my dataframe, if I try to plot a Box plot for a dataframe that has more than 28M rows then I can not plot it.

I know that there is render_mode="webgl" for working with larger data, but I can only seem to set that for Scatter and Line plot types.

So my question is, is there a way to produce interactive Box plots in Plotly for large dataframes? (Same question also holds true for Violin plots too.)

If there is not, then what is the limitation preventing the plot from rendering when the row count is greater than 28 million rows?

If this is not possible in Plotly then does anyone know of any alternative tools that I could produce big data Box/Violin plots using Python? For example would this be possible with ggplot2 or will the same limitation also exist in that too?

My ultimate aim is to produce nice interactive plots using Plotly and then create Dash dashboards that use these plots.

Many thanks

23/10/19: Additional Testing:

I downgraded Plotly to 3.10.0 and got the same result - no figure is rendered and I am just presented with a white screen. I have now upgraded back up to version 4.2 again.

Additionaly, I installed Cufflinks. I followed the process described here to get Cufflinks working with Plotly 4: https://github.com/santosjorge/cufflinks/pull/203

Cufflinks behaviour is almost identical to Plotly Express behaviour - if I let the plot render in the notebook then nothing happens (no crash/error, no output of any kind but cell marks itself as run). If I output it to a html file as per the accepted answer Edit in Cufflinks for plotly: setting cufflinks config options launches, then it produces a very large (approx 1.5gb) html file that again shows up as a white screen when opened.

As this issue seems to be caused by working on a large dataframe, I thought there might be an issue with the Jupyter notebook being unable to handle such a large volume of data. Therefore I tried adjusting the iopub.data_rate as per https://community.plot.ly/t/tips-for-using-plotly-with-jupyter-notebook-5-0-the-latest-version/4156 but it didn't make a difference.

As I am experiencing very similar behaviour in both Plotly Express and Cufflinks, this suggests to me that the issue must be to do with Plotly itself?

Has anyone had any success producing Box or Violin plots for larger datasets?

Upvotes: 2

Views: 2358

Answers (1)

mmTmmR
mmTmmR

Reputation: 573

In the end my solution was to move to holoviews.

import holoviews as hv
hv.extension('plotly')
boxwhisker = hv.BoxWhisker(test_df, 'occupation_status', 'age')
boxwhisker

Out[2]: enter image description here

Points to note:

  1. When I used the "bokeh" extension my plot rendered but was not interactive. However, when I used the "plotly" extension, my interactive box plot was successfully produced as per above. This is really interesting because when I try to produce this plot using plotly directly then it still crashes my browser.

  2. For some reason my "occupation status" categories have been truncated to a single letter. I am experimenting with holoviews opts xrotation and xticks but have yet to fix this. This is not the end of the world, however it would be nice to fix.

Upvotes: 0

Related Questions