Reputation: 303

Plotting large datasets as kind=bar ineffective

I am working with a semi-large data set of approx 100,000 records. When I plot a df column as a line with the code below the plot takes approx 2 seconds.

with plt.style.context('ggplot'):
    plt.figure(3,figsize=(16,12))
    plt.subplot(411)
    df_pca_std['PC1_resid'].plot(title ="PC1 Residual", color='r')

    #If I change the plot to a bar (no other change)
    df_X_std['PC1_resid'].plot(**kind='bar'**, title ="PC1 Residual", color='r')

it takes 112 seconds and the render changes like this (jumbled x axis):

I have suppressed the axis and changed the style but neither helped. Anyone have ideas how to better render and take less time? The data being plotted is being checked for mean reversion and is better displayed as bar plot.

Upvotes: 0

Answers (2)

tnf

Reputation: 303

One possible solution: I do not actually need to plot bars but can use the very fast line plot and the 'fill_between' attribute to color the plot from zero to the line. The effect is similar to plotting all the bars in a fraction of the time.

Use pydatetime method of DatetimeIndex to convert Date (the df index) to an array of datetime.datetime's that can be used by matplotlib then change the plot.

plotDates = mpl.date2num(df.index.to_pydatetime())

plt.fill_between(plotDates,0,df_pca_std['PC1_resid'], alpha=0.5)

Upvotes: 0

tnf

Reputation: 303

Not the best charts visually but at least it renders. Plotted 2.1 million bars in 14.2 secs.

import pygal                                                      
bar_chart = pygal.Bar()                                            
bar_chart.add('PC1_residuals',df_X_std['PC1_resid'])                        
bar_chart.render_to_file('bar_chart.svg')

Upvotes: 1

Plotting large datasets as kind=bar ineffective

Answers (2)

Related Questions